Here’s what we talked about:
Q: Talk to me about how you guys used Neo4j in the Panama Papers investigation.
Mar Cabra: At the International Consortium of Investigative Journalists, we have a team of programmers and journalists that work together to facilitate data analysis and data research, and to facilitate hundreds of journalists to work together. In the Panama Papers investigation, we gathered almost 400 journalists from around 80 countries to work together on 11.5 million files. There were also database files that then made these documents into a database, which we then transformed into Neo4j to explore the data in graph format.
What was that data about? It was about offshore companies and companies created in tax havens in places like the British Virgin Islands and Panama, and the people behind them. That’s why graphs were so great so see, “This is a shell company, but who’s really behind it?”
Q: How did you guys handle scale or performance for the 2.6 TB of data in the Panama Papers?
Mar: For us, the most important thing was that all my users – all my 400 journalists in 80 countries – could get to the research using the same tool, and sometimes very intensely, because they’re researching all the time in different time zones. For us, it was very important to have a tool that allowed us to visualize that data easily, and Linkurious was a great software that we could place on our servers, have it in our own environment without having to give the data to anybody.
It also allowed us to have a private environment where our reporters would log in with a username and password, and work on the data and even share the graphs. The good thing is that Linkurious could be set up very quickly. Once you have everything set up, then you just put the Neo4j data behind it, and immediately, you’re working with it. So that was the most important thing: that it was a easy deploy, and that it would not fail with those users doing such intensive work.
Q: Why did you guys choose Neo4j in the first place?
Mar: Because of Linkurious. I think that many clients have the same problem, which is, “I need to visualize my data.” Because we are visual individuals and visual animals. Everything comes through the eyes, so we need visuals to understand things that are complex.
We were looking for a graph visualization tool, and then we found Linkurious, and Linkurious was using Neo4j. That’s how we became interested in Neo4j. The good thing is that even if we had our data in SQL, we could easily transform it with an ETL tool – extract, transform and load – easily into Neo4j, plug it into Linkurious and that’s it.
Q: Can you talk to me about the other technologies that you guys used to turn those 11.5 million documents into something you could digest?
Mar: For us, it’s very important that our reporters get to communicate between themselves. One thing is to have knowledge available to reporters, the second thing is to get that knowledge shared. So it was very important for us to have an intranet to communicate. A social network that was private for our group of reporters to share tips, to share ideas, to share questions and to exchange all that and have a conversation. So we built a platform that allowed us to do that.
Remember, the reporters we work with are not people that work for the same company. They work for different media organizations, from different countries, different cultures. So having a hub where they can connect and talk like they were in the same news room is very important.
Of course, we used Neo4j and Linkurious to visualize the company data. This is the companies and tax havens that were created by a Panamanian law firm called Mossack Fonseca, one of the leading firms in this business in the world.
But we also had 11.5 million documents that were unstructured data – PDFs, emails, incorporation documents – and we needed to have a place where reporters could search that data. We also have our internal search engine where reporters could log in, search the documents and find stories.
Q: In terms of the technical side of things, what were some of the most interesting or surprising results you encountered?
Mar: My reporters are great. We have among the best investigative reporters in the world working with us, but the technological skills of my people are not always at the same level. On one hand, we have reporters that are very good investigative reporters, but they are not very technologically skilled. On the other hand, we have developers working with us who are very tech savvy.
I think that most of our reporters, however, was very fascinated to find connections through the graphs that they did not see before. I remember the first reactions were, “Oh, my God. This is magic!” or “Oh, my God. I found this person, and this person was connected to these three other people that I didn’t know about.”
Like I said, the discovery tool was great because the world is interconnected. Everything happens in a global way, and everything happens in an interconnected way. Crime is interconnected. Corruption is interconnected. People are interconnected. That’s how the world works, and that’s how we should investigate it.
That’s why the feedback from the reporters working with the ICIJ was very, very good because they really could apply it to their job, which is to investigate a wrong-doing.
Q: If you could take all the knowledge you have of Neo4j and Linkurious, and you could go back to the time when you first started, what would you do differently?
Mar: I know what I would do now. I think our work is not over. I think that there are so many more ideas of things we want to do.
For example, we know that we have a lot of emails. We haven’t looked at extracting the metadata from the emails and looked at the patterns and the graphs of the emails themselves, and that’s something that I know we can do with Neo4j. We’re looking forward to doing that if we have some time in the second semester of 2016. I know that would be something that I would like to be doing a year ago. But of course, there’s no time for everything.
Something else: I would try to get more data analysts to work with us to analyze the data. I think that we have a great team. We have three developers working on the team at the data unit, and two other journalists. So, we’re six people working together analyzing data, but that’s not enough. We have so many things to do – stories, platforms and asking questions of reporters. I would say, I would love to involve more data-savvy analysts that could be dedicated to looking at trends and patterns, so that we find stories that we may have missed.
The good news in all this is that we will be releasing the Mossack Fonseca internal database. So the database of all these Panama Papers offshore companies – more than 200,000 offshore companies. [Access the database download here.]
There will be an option to look at it on our website, but also to download it. I’m looking forward to all the ideas we’re going to be getting through that. At least we’re going to get some power of the people and crowdsourcing of stories that we may have missed.
Q: What are your thoughts on the future of data as it relates to journalism and media?
Mar: Let me give you a fact: Two years ago, the ICIJ did not have a data team in-house. And now we’ve grown from zero to six people, which is actually 50% of ICIJ permanent staff. Right now, the people we’re hiring are people that have a developer background, or a journalism background with data skills. We’re also talking to people that come from PhDs as data analysts.
I think that journalism is not a thing that we’re only going to be doing in isolation anymore. Journalists are going to be working with data. It’s not the future; it’s a reality of the fact that journalism is opening up to other professions, and some of those professions are developers and data analysts. At the ICIJ, I don’t think we will do any future investigation that doesn’t have a data component.
Q: How does it feel – both as the ICIJ corporately, and perhaps personally – to see the impact that this story has had?
Mar: Well, we’re journalists, and one of our goals is to put in the public interest the topics that we have uncovered that people didn’t know about. For us, the greatest impact of the Panama Papers investigation is the fact that the world is talking about it, and the fact that politicians and policy makers are talking about it, talking about taking action.
It’s not that tax havens were something that we didn’t know about. Everybody knew about tax havens. Everybody knows there are studies that show the impact that tax havens have in today’s economy and in terms of equality. But there was no political will, and I think that we’ve helped move that political will a little bit more into action.
I think our biggest success is that, not only did we put something in the public spotlight that was unknown, but also that we made such a big bang that politicians, policy makers and people that can take action are at least promising now to do things. The next thing is we need to follow up, because sometimes they just promise, and those promises vanish.
Q: Anything else you want to add or say? Any closing thoughts?
Mar: Data is everywhere, and data is part of our lives. We generate data when we move. I’ve been generating data all day just by Twitter, by geolocation. I think that we need to become more and more data savvy and tech savvy.
I think that regular individuals – us journalists, even my mother (well, okay, probably not my mother) – but everybody should have some data skills to understand the world. Because if not, in a few years, we’re going to get lost without it.
Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at firstname.lastname@example.org.
Want to learn more about how Neo4j can be used in fraud detection and data analysis? Click below to read this white paper, Fraud Detection: Discovering Connections with Graph Databases, and start stopping fraudsters in their tracks.