Neo4j Enables Pulitzer Prize-Winning Investigation into Global Tax Evasion
The ChallengeThe Panama Papers investigation has been the biggest data leak in the ICIJ’s 21-year history – and the biggest data leak of all time. In 2015, an anonymous leak of 11.5 million documents from Panama-based law firm Mossack Fonseca revealed the illicit use of offshore bank accounts by the world’s rich and famous.
The material comprised 40 years’ worth of confidential documents relating to over 200,000 companies in 21 tax havens, ranging from Switzerland and Hong Kong to Nevada in the US. These hideaways are used by individuals to conceal their true wealth from the tax authorities, behind a web of shell companies and accounts registered to front men or close relatives.
Initially, though, the ICIJ’s journalists struggled to sift through this trove of emails, financial spreadsheets, passports and corporate records, written in English, French, Spanish, Russian, Mandarin and Arabic.
“It was a shock at first,” said the ICIJ’s Data Editor Mar Cabra.
She recognized the ICIJ needed accessible technology to analyze this interconnected dataset and uncover the complex web of connections. And Cabra’s past experience suggested graph technology could be the answer.
The SolutionThe ICIJ had already deployed a graph system in 2013 to publicly present the findings from its Offshore Leaks inquiry.
“This graph was the most successful product the ICIJ had ever used,” Cabra said. “You could enter a name and just double click, and the networks would expand. Millions of people had gone into it. So in the Panama Papers investigations, we knew that we needed graphs to understand the data better.”
To tame the 2.6 terabytes of Panama Papers data, the ICIJ extracted the document metadata using Apache Solr and Tika, then connected all the information together in a Neo4j graph database, accessed by the Linkurious data visualization tool. Alongside this, its member journalists used the OXWALL open source social platform to share their findings, tips, leads – and threats – relating to the investigation.
The ICIJ’s developers built the Neo4j graph around the leaked data’s key entities such as companies, their clients and officers. This enabled the journalists to uncover relationships between these core nodes – matching, say, bank accounts to people who had the same address, family ties or business links, or who regularly emailed each other.
The graph comprised 840,000 nodes and 1.3 million relationships, but the reporters could simply type in an individual’s name and instantly reveal their web of connections. They could also dig deeper into the data through advanced Cypher queries.
The result was a huge leap forward from previously used technologies.
“The graph allowed you to explore these networks in a very, very easy way that anybody could understand,” Cabra said. “My journalists were amazed. We felt like we had superpowers, because the reaction was, ‘Oh my God, I did not see these connections before by looking through documents, I’m finding more stories.’ To them, this was magic. With graph databases you’re basically able to find connections that you couldn’t see before when working in an SQL database.”