ICIJ Empowers Investigative Journalists with Neo4j Graph Technology

Graph databases slash news investigation times from months to hours and enable reporters to unearth previously hidden facts and relationships.


Founded to expose wrongdoing and trigger positive change, the International Consortium of Investigative Journalists (ICIJ) brings together hundreds of investigative reporters and media organizations worldwide to untangle complex webs of corruption.

ICIJ’s investigative tools and expert journalists work across borders to enable news organizations to combat an onslaught of misinformation and break some of the world’s biggest stories. Whether partnering to uncover concealed information about shadow financial systems serving the rich, companies profiting by destroying ecosystems, or exposing other abuses, ICIJ stays true to its mission to foster positive change by ensuring people know what is going on.

Technology serves a critical role in supporting the organization’s accomplished team of almost 300 journalists as they partner with media organizations. “One of the main challenges facing any journalist is sifting through massive amounts of information to discover the truth and expose lies,” says Pierre Romera Zhang, Chief Technology Officer at ICIJ. “Even a story that at first appears local can reach across borders and involve thousands of complex interactions. That’s where Neo4j is critical.”

ICIJ has made a name for itself by helping surface the truth. The organization has received many of journalism’s most prestigious awards, including a Pulitzer Prize for Explanatory Reporting for The Panama Papers, which revealed vast financial corruption through offshore companies, using a Neo4j graph database to identify hidden patterns and make breakthroughs in the data.

“Most journalists aren’t trained to be data scientists and news organizations aren’t set up to manage troves of disconnected information —yet that’s exactly what’s needed today to find and tell original, vital stories,” says Romera Zhang.

Transforming Journalism With Technology

Datashare, a secure document analysis platform built on open source tools, including Neo4j, is at the heart of ICIJ’s major global investigations. Now relying on Neo4j graph database technology, Datashare is optimized to scale horizontally and quickly access data from multiple systems to uncover relationships previously difficult or even impossible to see with common technologies like relational databases.

The solution brings together a decade of ICIJ’s investigative expertise so journalists can extract insights from massive datasets. “The platform levels the playing field for journalists, who now have access to powerful data and analytics tools traditionally used in other industries,” explains Romera Zhang.

Datashare combines metadata extraction, search capabilities, and graph technology in a single package.
Above: Datashare combines metadata extraction, search capabilities, and graph technology in a single package.

 

A single investigation can involve tens of millions of documents, making it virtually impossible for journalists to make connections manually. Before Datashare, journalists could spend months, or sometimes years, sifting through complicated webs of information to connect people and entities across countries.

ICIJ’s work on the Panama Papers involved getting through 2.9 terabytes of information in 11.9 million records. “With Neo4j graph technology, we made connections between activities and entities that otherwise would have been missed,” says Romera Zhang. “That’s when we had our idea for Datashare to give journalists a powerful tool to expose corruption.”

A timeline of ICIJ’s investigations and Datashare development.
Above: A timeline of ICIJ’s investigations and Datashare development.

 

When collaborating with other journalists on the Panama Papers, the ICIJ team successfully mapped connections between people, governments, and corporations by extracting document metadata using Apache Solr and Tika and exporting the data to a Neo4j graph database.

As the foundation for Datashare, Neo4j makes traversing connections between nodes fast by embedding relationships in the database structure. With traditional relational databases, journalists and data analysts have to infer data connections using inefficient foreign keys.

Foreign keys are columns in traditional relational databases that are linked to columns in different tables. In a graph structure, these relationships are quick to traverse. This efficiency is especially important for datasets with complex, interdependent relationships, such as those found in leaked documents.
Above: Foreign keys are columns in traditional relational databases that are linked to columns in different tables. In a graph structure, these relationships are quick to traverse. This efficiency is especially important for datasets with complex, interdependent relationships, such as those found in leaked documents.

 

“Relational databases cannot efficiently analyze relationships within the large, densely interconnected datasets journalists encounter,” says Romera Zhang. “With Neo4j, creating a relationship map—the ontology—is easy, regardless of topic. The graph offers journalists a clear view of reality, so they can fact-check information and keep investigations on track.”

Datashare visualizes employee relationships at Enron.
Above: Datashare visualizes employee relationships at Enron.

 

Datashare became ICIJ’s internal platform for analyzing large volumes of data in 2019, beginning with the Luanda Leaks. That reporting brought to light how decades of unchecked greed left an oil- and diamond-rich African country impoverished. Reporters used Datashare to analyze text messages, emails, PDFs, and other records to reconstruct the timelines of significant meetings and conversations between key figures to reveal the truth behind ‘what did they know and when did they know it.’

Datashare automatically detects and filters data by person, organization, and location.
Above: Datashare automatically detects and filters data by person, organization, and location.

 

Datashare is one of the few data tools on the market that can ingest large, complex .pst files representing entire mailboxes from Microsoft Outlook—which sometimes reach several gigabytes—and instantly produce searchable results.

Zeroing In On ‘Smoking Guns’

ICIJ continues to use Datashare with Neo4j graph technology to cut through misinformation and surface truth where it is desperately needed.

The organization’s investigations into unlawfully obtained art and antiquities led to agreements with prominent museums and collectors to repatriate ancient statues and sculptures to Nepal, Cambodia, and Thailand. ICIJ’s “Deforestation Inc.” investigation illuminated forest destruction and human rights violations taking place under the guise of certified sustainability.

The consortium’s “Implant Files” investigation revealed how health authorities worldwide failed to protect millions of patients from poorly tested medical devices. More than 2.6 million people have since tapped into ICIJ’s Offshore Leaks Database to explore connections between world leaders, politicians, their family members, and other associates.

In 2020, ICIJ worked with over 100 media partners to publish the FinCEN Files. The work highlighted how global, US-based banks initiated transactions to move more than $2 trillion and evade money laundering rules. U.S. lawmakers took action and passed the Corporate Transparency Act to stop the movement of dirty money and make company owners more accountable.

Expanding Access, Options for Hard-Hitting Reporting

Romera Zhang and his team are rolling out Datashare as a Service so media partners can index large volumes of files without demanding in-house computing resources. The team is also developing a sandbox environment to allow journalists to run their preferred AI algorithms on datasets directly within the Datashare platform, making analysis even more accessible.

Neo4j Bloom is embedded into ICIJ Datashare, allowing journalists to search for patterns in the data.
Above: Neo4j Bloom is embedded into ICIJ Datashare, allowing journalists to search for patterns in the data.

 

ICIJ created a new Neo4j plug-in for Datashare to make the tool even easier for journalists to use. Designed with ICIJ investigations in mind, the plug-in streamlines creating graph databases so nontechnical journalists don’t need to master Cypher, Neo4j’s graph query language. The plug-in allows users to access graph statistics and explore connections between nodes using Neo4j’s visualization tool, Neo4J Bloom.

“Neo4j is the most important graph database on the market, and we are confident in the company and its technology,” Romera Zhang says. “That’s a major advantage as we democratize data access and analysis so investigative journalists can tell more stories that change the world.”