By Michael Hunger & William Lyon, Developer Relations | November 17, 2017
Today, the ICIJ has publicly released data from its most recent year-long investigation into the offshore industry, known as the Paradise Papers.
In the last few weeks since the ICIJ announced their investigation, we’ve seen many reports being published covering activities of companies like Nike, Apple, the Queen of England’s estate, connections of Russian investments to politicians like Wilbur Ross and companies like Facebook and Twitter.
More than 13 million leaked documents, emails and database records have been analyzed using text analysis, full-text- and faceted-search and most interestingly to us, graph visualization and graph-based search.
The International Consortium of Investigative Journalists (ICIJ) makes use of the Neo4j graph database internally to aid their investigations. As the ICIJ says on their website:
Graph databases are the best way to explore the relationships between these people and entities — it’s much more intuitive to use for this purpose than a SQL database or other types of NoSQL databases. In addition, graphs allow to understand these networks in a very intuitive way and easily discover connections.The ICIJ has built a powerful search engine that sits atop Neo4j that allows for searching the Paradise Papers dataset and has made this available to the public as a web application. However, releasing the data as a Neo4j database enables much more powerful analysis of the data. Since Neo4j is an open source database this means that everyone has access to the same powerful tools for making sense of the data.
In a previous post we showed how graph analysis and Cypher – the query language for graphs – can be used to query the data to find connections in the Paradise Papers data. In this post, we show some techniques for querying and analyzing the data in Neo4j, including how we can create data visualizations to help up draw insight, and how we can use graph analysis to learn more about the offshore finance industry.
For a more thorough overview of the data model and example queries, see our previous post here.
The Data Model
The Paradise Papers dataset uses the property graph data model to represent data about offshore legal entities, officers who may be beneficiaries or shareholders of the entities, and the intermediaries that acted to create the legal entities.
The nodes in the graph are the entities and relationships connect them. We also store key-value pair properties on both the nodes and relationships, such as names, addresses, and data provenance attributes.
Graph visualization is a powerful way to explore the data. For example, identifying highly connected clusters of nodes can be done by visually examining the graph.
We can also perform aggregations when we query for tabular data. Let’s examine the overall size and shape of the Paradise Papers dataset.
How many nodes are there in the Paradise Papers dataset?
MATCH (n) RETURN labels(n) AS labels, COUNT(*) AS count ORDER BY count DESC ╒════════════════╤═══════╕ │"labels" │"count"│ ╞════════════════╪═══════╡ │["Officer"] │77012 │ ├────────────────┼───────┤ │["Address"] │59228 │ ├────────────────┼───────┤ │["Entity"] │24957 │ ├────────────────┼───────┤ │["Intermediary"]│2031 │ ├────────────────┼───────┤ │["Other"] │186 │ └────────────────┴───────┘
We can see that the data consists of information on over 84,000 officers (these are people or companies who play a role in an offshore company) with connections to almost 25,000 offshore legal entities, across 63,000 addresses. The addresses will become important to us later as we made use of location data.
We can also count the number of the different types of relationships in the dataset:
MATCH ()-[r]->() RETURN type(r), COUNT(*) ORDER BY COUNT(*) DESC ╒════════════════════╤══════════╕ │"type(r)" │"COUNT(*)"│ ╞════════════════════╪══════════╡ │"OFFICER_OF" │221112 │ ├────────────────────┼──────────┤ │"REGISTERED_ADDRESS"│128311 │ ├────────────────────┼──────────┤ │"CONNECTED_TO" │10552 │ ├────────────────────┼──────────┤ │"INTERMEDIARY_OF" │4063 │ ├────────────────────┼──────────┤ │"SAME_NAME_AS" │416 │ ├────────────────────┼──────────┤ │"SAME_ID_AS" │2 │ └────────────────────┴──────────┘
And compute degree distribution, to give us an idea of how connected different pieces of the graph are, on average:
MATCH (n) WITH labels(n) AS type, SIZE( (n)--() ) AS degree RETURN type, MAX(degree) AS max, ROUND(AVG(degree)) AS avg, ROUND(STDEV(degree)) AS stdev ╒════════════════╤═════╤═════╤═══════╕ │"type" │"max"│"avg"│"stdev"│ ╞════════════════╪═════╪═════╪═══════╡ │["Other"] │2891 │44 │236 │ ├────────────────┼─────┼─────┼───────┤ │["Address"] │9268 │2 │59 │ ├────────────────┼─────┼─────┼───────┤ │["Intermediary"]│115 │5 │8 │ ├────────────────┼─────┼─────┼───────┤ │["Officer"] │2726 │4 │20 │ ├────────────────┼─────┼─────┼───────┤ │["Entity"] │312 │11 │13 │ └────────────────┴─────┴─────┴───────┘
The Shortest Path from the Queen of England to Rex Tillerson
One powerful feature of a graph database like Neo4j is the ability to query for paths of arbitrary length. This allows us to find connections between nodes when we don’t know what the connections are, or even the length of the path.
I was curious to see if there were any indirect connections between two public figures who appear in the Paradise Papers dataset: Rex Tillerson, the U.S. Secretary of State who had connections to a Bermuda-based oil and gas company with operations in Yemen, and the Queen of England, whose estate, it was reported, was an investor in a Bermuda-based company. We can easily query for such a path using Cypher:
MATCH p=shortestPath((rex:Officer)-[*]-(queen:Officer)) WHERE rex.name = "Tillerson - Rex" AND queen.name = "The Duchy of Lancaster" RETURN p
This shows us a single shortest path connecting the Queen of England and Rex Tillerson. The path goes through several offshore entities and officers with connections to these entities. If we adjust our query slightly to include all shortest paths, we see that several of the officers in our path share connections with many legal entities.
MATCH p=allShortestPaths((rex:Officer)-[*]-(queen:Officer)) WHERE rex.name = "Tillerson - Rex" AND queen.name = "The Duchy of Lancaster" RETURN p
A quick Google search reveals that these individuals are corporate services managers: individuals who are paid to serve as directors of offshore entities to handle the administration of these entities.
Querying the data using Cypher is useful for exploring the graph and answering questions that we have, such as “What are all the offshore legal entities that Wilbur Ross is connected to?”. But what if we want to know who are the most influential nodes in the network? Or elements of the graph who have the highest transitive relevance?
We can easily run the PageRank centrality algorithm on the whole graph dataset using Cypher:
and then query for the
Entitynode with the highest PageRank score:
MATCH (e:Entity) WHERE exists(e.pagerank_g) RETURN e.name AS entity, e.jurisdiction_description AS jurisdiction, e.pagerank_g AS pagerank ORDER BY pagerank DESC LIMIT 15 ╒═════════════════════════════════════════════════╤════════════════╤══════════════════╕ │"entity" │"jurisdiction" │"pagerank" │ ╞═════════════════════════════════════════════════╪════════════════╪══════════════════╡ │"WORLDCARE LIMITED" │"Bermuda" │18.110508499999998│ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"Ferrous Resources Limited" │"Isle of Man" │17.326935999999996│ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"American Contractors Insurance Group Ltd." │"Bermuda" │15.6201275 │ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"Gulf Keystone Petroleum Limited" │"Bermuda" │12.81925 │ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"Warburg Pincus (Bermuda) Private Equity X, L.P."│"Bermuda" │12.312412 │ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"Madagascar Oil Limited" │"Bermuda" │11.611646499999999│ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"Coller International Partners IV-D, L.P." │"Cayman Islands"│11.394854 │ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"Milestone Insurance Co., Ltd." │"Bermuda" │11.224089 │ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"CL Acquisition Holdings Limited" │"Cayman Islands"│11.0752455 │ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"Alpha and Omega Semiconductor Limited" │"Bermuda" │10.965910000000001│ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤ │"Coller International Partners V-A, L.P." │"Cayman Islands"│10.8205005 │ ├─────────────────────────────────────────────────┼────────────────┼──────────────────┤
The registered addresses of many of the officers and legal entities are available in the Paradise Papers data. Using a service such as the Nominatim API or Google’s geocoding API we can perform a lookup to turn these address strings into latitude and longitude points.
One visualization tool we can use is a heatmap, where observations are represented as colors. More intense colors mean more addresses in that area. Examining a heatmap of Paradise Papers addresses shows a high concentration of addresses in the Atlantic, just off the coast of North America. Many of these addresses are in Bermuda, a known offshore jurisdiction.
Heatmap of Paradise Papers geocoded addresses. Try it live.If we compare this heatmap with a heatmap of geocoded addresses from the Panama Papers dataset (an earlier leak investigated by ICIJ), we can see we have quite a different geographic distribution of addresses.
Instead of a large concentration in the Atlantic, we see a higher concentration in Asia and, to a lesser degree, Europe. The Panama Papers leak has a high number of addresses in Singapore and Kuala Lumpur.
Heatmap of Panama Papers geocoded addressesUsing the geocoded addresses, we can also interactively explore the Paradise Papers as a map. Clicking on an address marker of interest issues a Cypher query to find the
Entitynodes connected to this address.
Exploring the ritzy suburbs of Las Vegas, we can see many addresses that show up in the Paradise Papers. In fact, we easily stumble upon the casino magnate Sheldon G. Adelson who it was revealed has a connection to a Bermuda company he uses to register his casino’s private jets, transferring tens of millions of dollars to a tax-free jurisdiction.
Annotated map of geocoded addresses in Paradise Papers showing registered address of
Officer nodes and connected legal entities and jurisdictions. Try it live.
When looking at the implications of the structure of the offshore finance industry, one of the questions investigative journalists try to answer is “Who are the enablers?” One aspect of finding enablers is to look at the jurisdictions that make the offshore industry possible.
One can theorize about historical, legal, and economic reasons why some jurisdictions may be chosen for citizens of certain countries, but data like the Paradise Papers are so important for gaining insight into the offshore finance industry because much of this world is so secretive. Next, we examine some of the jurisdiction information in the data.
MATCH (e:Entity) WITH e.jurisdiction_description AS juris, COUNT(*) AS count WHERE count > 20 RETURN * ORDER BY count ASC
We can see that Bermuda and the Cayman Islands far outnumber the other jurisdictions. This makes sense given what we know about the main source of the data, which was a law firm with offices in Bermuda (and many other countries).
We can extend our analysis to begin to answer the question, “Are there certain jurisdictions that citizens of particular countries prefer?” or “What are the most popular offshore jurisdictions, by country of residence of the beneficiary or officer? We can begin to take a look at that answer by creating a bipartite graph of
Officercountry and entity jurisdiction. We can visualize this data in a chord diagram that shows us the relative distribution of flow through the bipartite graph.
MATCH (a:Address)--(o:Officer)--(e:Entity) WITH a.countries AS officer_country, e.jurisdiction_description AS juris, COUNT(*) AS num RETURN * ORDER BY num DESC
This diagram shows us that the United States is by far the most popular country for
Officerto give as their registered address. And of those officers with addresses in the US, Bermuda and Cayman Islands are the most popular offshore jurisdictions. This is not surprising as we saw earlier that those two jurisdictions are by far the most popular in the dataset.
What Can You Find?
This was an overview of the now-public Paradise Papers dataset released by ICIJ. ICIJ has released the leaked data packaged as a Neo4j database to enable everyone to use the same open source software they use for making sense of the complex web of the offshore finance industry.
You can find the Paradise Papers dataset available on the Neo4j Sandbox and soon available for download as a Neo4j database on the ICIJ website. We encourage you to explore the data and see what insights you can find about the offshore finance industry.
As you explore the data, be sure to check out some of the great resources for learning Cypher and graph databases. And if you like the work that the ICIJ is doing, remember that they are an independent media organization and rely on your generous donations to operate.
You can find the code for generating all visualizations in this post on GitHub.
Download My Free Copy
Editor’s note: ICIJ has published this data with the following note: “There are legitimate uses for offshore companies and trusts. We do not intend to suggest or imply that any people, companies or other entities included in the ICIJ Offshore Leaks Database have broken the law or otherwise acted improperly. Many people and entities have the same or similar names. We suggest you confirm the identities of any individuals or entities located in the database based on addresses or other identifiable information.”
About the Author
Michael Hunger & William Lyon, Developer Relations
Michael Hunger has been passionate about software development for a very long time. For the last few years he has been working with Neo Technology on the open source Neo4j graph database filling many roles. As caretaker of the Neo4j community and ecosystem he especially loves to work with graph-related projects, users and contributors. As a developer Michael enjoys many aspects of programming languages, learning new things every day, participating in exciting and ambitious open source projects and contributing and writing software related books and articles.
William Lyon is a Developer Relations Engineer at Neo Technology. Prior to joining Neo, William worked as software developer for several startups in the real estate, quantitative finance and predictive API spaces. William holds a Masters degree in Computer Science from the University of Montana.
From the CEO
Have a Graph Question?
Reach out and connect with the Neo4j staff.Stackoverflow
Share your Graph Story?
Email us: firstname.lastname@example.org