How the ICIJ Used Graph Database Technology to Uncover the Swiss Leaks Story
Traditionally, reporters have to try and spot relationships between data in Excel files, conduct manual Internet searches and sometimes physically draw out connections between people and entities to get the right facts for their stories.
However, Davet and Lhomme recognized that the Swiss Leaks dataset was simply too complex to analyse manually or by themselves. So they turned to the International Consortium of Investigative Journalists (ICIJ), which started one of the biggest journalistic collaborations of all time.
Mar Cabra, editor of the Data and Research Unit at the ICIJ, knew they would need a tool that could better analyse the relationships in the data for both this and future investigations.
The Swiss Leaks data included information from HSBC account holders located in more than 200 countries collectively holding sums in excess of $100 billion. But their information was scattered in thousands of files with no straightforward connection among each other. The complexity of the data meant Cabra and the ICIJ needed a means of analysing the vast amounts of unstructured data and making sense of it quickly and easily.
“While working on stories like Offshore Leaks, I learned how important graph analysis is when investigating financial corruption," Cabra said. "Connections are the key to understanding what the real story is: they show who's doing business with whom. We decided that early on that we needed to use a graph-database approach for the HSBC Leaks."
The Data and Research Unit’s first move was to recreate the HSBC client database from the provided plain Excel files. Next, they connected every name to one or several countries (both referred to as the ‘nodes’ in the graph database). Finally, they turned the data into a graph format to explore the connections between nodes.
In total, the leak held around 60,000 files that contained information about over 100,000 clients in 203 countries. The resulting graph database had more than 275,000 nodes with 400,000 relationships among them.
The ICIJ worked with open source integration software specialist Talend to transfer the original dataset into Neo Technology’s Neo4j graph database. Another Neo partner, Linkurious, provided a web app as a user interface, so that the graph database could be visualised and easily accessed by reporters.
The graph visualisation approach allowed ICIJ journalists to identify the connections between people and bank accounts, helping them ‘follow the money’ to identify dozens of instances of fraud, corruption and tax evasion.