Excerpt from article published by Forbes
It’s the largest information leak of all time: 2.6 terabytes, representing 11.5 million documents, all of them from the Panama-based law firm Mossack Fonseca, known for its work helping clients establish offshore corporations in tax havens.
It’s a setup worthy of the finest spy thriller, and one day that dialogue will surely play out on the big screen. What happens next is another matter.
The material that John Doe leaked to Süddeutsche Zeitung included nearly five million emails, three million database records (often called a “row” of data, these are little groupings of related facts, such as the details of a transaction or an individual’s contact information) and 3.5 million files of images, pdfs or other forms, most of them representing scanned paper documents. What’s the reporter going to do now? He can’t read and make sense of 11.5 million documents.
Volume is not the only issue. The database records don’t mean anything without their schema, a guide that defines how the material is organized, but that was not included with the leak. The documents are in many languages. It will require a team of technical experts equipped with significant software and computer hardware resources just to make the data accessible.
Real life investigation is not for loners. No one person, nor single media outlet, could handle this leak alone, so Süddeutsche Zeitung turned to the International Consortium of Investigative Journalists (ICIJ) to provide resources for data sharing and analysis and coordinate a team that grew to involve hundreds of journalists over the course of the past year. ICIJ, which has a network of journalists spanning 65 countries, has coordinated investigation of four major leaks involving offshore corporations over the past few years. Recognizing the growing technical complexity of reporting, it has been developing expertise and resources to handle complex data investigation.
Tools used in connection with the Panama Papers data analysis work:
- Apache Tika – data and metadata extraction
- Apache Solr – indexing
- Blacklight https://projectblacklight.org/ a user interface
- Amazon Web Services Cloud – virtual servers tesseract – optical character recognition (OCR)
- Veracrypt encryption for hard drives
- Talend, data extraction, transformation, and loading (ETL)
- Neo4j – data storage
- Nuix – OCR, data indexing, visualization
- Linkurious – user interface/visualization
- Oxwall – social network development
- PGP – secure communication Hashmail – secure communication
- Phreema – secure communication
- Signal – secure communication
- Internally created tools, including customization of other tools to enhance security