Led by the International Consortium of Investigative Journalists (ICIJ), this financial data leak exposed the highly interconnected and multi-layered offshore tax structures created for some of the world’s highest public officials and other celebrities by the Panamanian law firm Mossack Fonseca over the past 40 years.
At 2.6 Terabytes and 11.5 million documents, from a size perspective the Panama Papers dwarf every data leak that’s dominated headlines in the last decade. Oh, did I say “decade”? I meant ever.
The comparative size of the Panama Papers data leak. (Image source: USA Today)With a leak this big (and this global), there are a lot of legal and political implications to be worked out, but I’d like to double-click on one aspect in particular: how all this data was analyzed in the first place.
The Role of Graph Technology in the Panama Papers
This isn’t the first time the ICIJ has pulled off an investigation at this level; it’s just the latest in a series of data journalism wins. Only last year, the ICIJ published the Swiss Leaks story, exposing the fraudulent activity of 100,000 HSBC private bank clients in Switzerland.
What ties these two data leaks stories together is how the ICIJ team worked with their data: as a graph.
Mar Cabra, the ICIJ’s Data and Research Unit Editor, has said that when the Swiss Leaks material crossed her desk she knew she needed a different kind of tool to analyze such a complex and interconnected dataset, one that could process such a large volume of connections quickly and efficiently.
She’s also said she wanted an easy-to-use and intuitive solution that didn’t require the intervention of a data scientist or developer. For Cabra and the ICIJ, the data discovery and analysis process had to be accessible to investigative journalists around the globe – regardless of their technical background.
As the robustness and depth of the Panama Papers investigation has clearly shown, Cabra’s decision to use a graph-based approach, specifically Linkurious and Neo4j, was the right choice.
Big Data Analysis: No Longer the Domain of the Ivory Tower
Let’s take a step back.
For over a decade, big web firms like the Googles and the Facebooks of the world have built up a serious array of skills and tools that allow them to derive insight and value from massive amounts of data. Data is their core differentiation and their business models depend on increasingly sophisticated ways of working with information.
At the same time, Snowden and others have revealed to us that over the last several years, big governments have also built up the capability to process huge datasets that track our digital lives. Both these governments and big web companies have vast resources in terms of time, money and PhDs to devote to this level of data processing and analysis.
But outside these two groups, this capability has been sorely lacking. If the Panama Papers leak had happened ten years ago, no story would have been written because no one else would have had the technology and skillset to make sense of such a massive dataset at this scale.
If anything else, this data leak makes it strikingly clear how important it is that highly scalable data analysis be made available to everyone: whether that’s a startup trying to disrupt a long-established incumbent or a small gang of investigative journalists that need to make sense of the biggest data leak in history.
So while global organizations have been amassing these proprietary processing capabilities, there’s been a parallel movement towards an open technology stack for working with connected data of this magnitude. And at the center of that stack is Neo4j.
The democratization of technologies to make sense of data at scale is an important part of a free and open society, and I’m proud of the role we play in that evolving landscape – not only in the case of Swiss Leaks and the Panama Papers, but in solving future problems we can’t even yet imagine.