Analyzing the BuzzFeed TrumpWorld Dataset with Neo4j


Update: The Google Doc source and structure changed, we updated the import statements and queries on the trumpworld-graph GitHub repository.

I came across this tweet by Sanchez Castro which I am more than happy to support.


As part of the Buzzfeed article Help Us Map TrumpWorld, the four investigative journalists, John Templon, Alex Campbell, Anthony Cormier, and Jeremy Singer-Vine asked the public to help them map and analyze the data that they investigated, confirmed and published surrounding the business connections of Donald J. Trump:

Now we are asking the public to use our data to find connections we may have missed, and to give us context we don’t currently understand. We hope you will help us — and the public — learn more about TrumpWorld and how this unprecedented array of businesses might affect public policy.
As you probably know, Neo4j has been used in investigative journalism before. Most notably in the collaborative work of the International Consortium of Investigative Journalists (ICIJ) on the Panama Papers investigations, but also by other investigative organizations and journalists. To support this kind of work, we launched the Data Journalism Accelerator Program last year to help investigative data journalists better leverage graph database technology.

The ICIJ data journalism investigation of the Panama Papers, powered by Neo4j


So, this below is our small contribution toward mapping and analyzing the first part of the TrumpWorld dataset. Much more connected data analysis is possible, which I’ll outline at the end of the article.

Digging into the TrumpWorld Dataset


The Buzzfeed article points to a Google Spreadsheet with collected, verified and researched data of the connections of organizations and people related to Trump with each group listed on a separate tab.

The data is fortunately available as a publicly readable document, so we can use our old trick of downloading it as CSV/TSV and using the download URL with LOAD CSV to import the dataset into Neo4j.

If you want to, download and start Neo4j (it only takes a minute). Then you can run these queries in your local Neo4j instance, or have a look at the interactive variant of this post as a Graph Gist (our data science notebooks for graphs). All example queries in this blog post use the Cypher graph query language.

Neo4j GraphGist of the TrumpWorld dataset


If you just want to run Neo4j with this dataset using a Docker container, you can do so. My colleague Mark Needham prepared one for you to use
docker run --publish=7474:7474 --publish=7687:7687 markhneedham/neo4j-3.1-trumpworld
then you can open Neo4j as usual on https://localhost:7474 to access the data.

Let’s have a look at the data first. Here is a sample:

WITH
'https://docs.google.com/spreadsheets/u/1/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&gid=634968401' AS url
LOAD CSV WITH HEADERS FROM url AS row
RETURN row.`Organization A`,row.`Organization B`,row.Connection, row.`Source(s)`
LIMIT 5

row.`Organization A` row.`Organization B` row.Connection row.`Source(s)`
4 SHADOW TREE LANE MEMBER CORP. 4 SHADOW TREE LANE LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html
40 WALL DEVELOPMENT ASSOCIATES LLC 40 WALL STREET LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html
40 WALL STREET LLC 40 WALL STREET COMMERCIAL LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html
40 WALL STREET MEMBER CORP. 40 WALL STREET LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html
401 MEZZ VENTURE LLC 401 NORTH WABASH VENTURE LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html

You can see that it’s quite straightforward: two columns for the node-entities (i.e., organizations), one each for the type of connection, along with the source of the information which represent the relationships.

As the connection information is quite verbose and not regular enough to serve as relationship-type, we either have to simplify it, as is demonstrated in this more detailed TrumpWorld graph.

Or we can use one generic relationship-type per mapping, which we will do here for the sake of simplicity:
    • (:Organization)-[:CONNECTED_TO]->(:Organization)
    • (:Person)-[:INVOLVED_WITH]->(:Organization)
    • (:Person)-[:RELATED_TO]->(:Person)
We store the “connection” and “source” information on each relationship.

Donald TrumpWorld graph data model


With this simple model at hand, we can now go ahead and import the data into our graph, following the same basic pattern:
  1. We get the data from the CSV file
  2. Create or find (MERGE) the two nodes
  3. Connect them with a relationship

Data Import


Data Constraints

As we don’t want duplicate entries for organizations and people, we set up two constraints:

CREATE CONSTRAINT ON (o:Organization) ASSERT o.name IS UNIQUE;
CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE;

Connect Organizations with Other organizations from the 1st Tab

WITH 'https://docs.google.com/spreadsheets/u/1/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&gid=634968401' AS url
LOAD CSV WITH HEADERS FROM url AS row
MERGE (o1:Organization {name:row.`Organization A`})
MERGE (o2:Organization {name:row.`Organization B`})
CREATE (o1)-[con:CONNECTED_TO]->(o2)
SET con.connection=row.Connection, con.source=row.`Source(s)`

Connect People with Organizations from the 2nd Tab

WITH 'https://docs.google.com/spreadsheets/u/1/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&gid=1368567920' AS url
LOAD CSV WITH HEADERS FROM url AS row
MERGE (p:Person {name:row.Person})
MERGE (o:Organization {name:row.Organization})
CREATE (p)-[con:INVOLVED_WITH]->(o)
SET con.connection=row.Connection, con.source=row.`Source(s)`

Connect People with other People from the 3rd Tab

WITH 'https://docs.google.com/spreadsheets/u/1/d/1Z5Vo5pbvxKJ5XpfALZXvCzW26Cl4we3OaN73K9Ae5Ss/export?format=csv&gid=905294723' AS url
LOAD CSV WITH HEADERS FROM url AS row
MERGE (p1:Person {name:row.`Person A`})
MERGE (p2:Person {name:row.`Person B`})
CREATE (p2)-[con:RELATED_TO]->(p1)
SET con.connection=row.Connection, con.source=row.`Source(s)`

Import Results


So, in total we imported 1514 nodes and 1857 relationships.

And this is the full TrumpWorld graph in all its “beauty,” after querying for all data (raise the view limits in the sidebar config), go to fullscreen and then zoom out.

Discover how to use Neo4j to analyze the TrumpWorld dataset provided by data journalists at BuzzFeed


Running Some Simple Graph Database Queries


Most-Connected Organizations

What are the organizations that have the most connections within the TrumpWorld dataset, and which are those mainly?

MATCH (o:Organization)-[r]-()
RETURN o.name, count(*), collect(distinct type(r)) AS types
ORDER BY count(*) DESC
LIMIT 5

╒═════════════════════════════════════╤═══════════╤═══════════════════════════════════╕
│"o.name"                          │"count(*)"│"types"                         │
╞═════════════════════════════════════╪═══════════╪═══════════════════════════════════╡
│"THRIVE CAPITAL"                  │86        │["CONNECTED_TO","INVOLVED_WITH"]│
├─────────────────────────────────────┼───────────┼───────────────────────────────────┤
│"40 WALL STREET LLC"              │41        │["INVOLVED_WITH","CONNECTED_TO"]│
├─────────────────────────────────────┼───────────┼───────────────────────────────────┤
│"DJT HOLDINGS LLC"                │36        │["INVOLVED_WITH","CONNECTED_TO"]│
├─────────────────────────────────────┼───────────┼───────────────────────────────────┤
│"KUSHNER COMPANIES"               │32        │["INVOLVED_WITH","CONNECTED_TO"]│
├─────────────────────────────────────┼───────────┼───────────────────────────────────┤
│"TRUMP HOTELS & CASINO RESORTS"   │28        │["INVOLVED_WITH","CONNECTED_TO"]│
└─────────────────────────────────────┴───────────┴───────────────────────────────────┘

The Second-Degree Kushner Network

MATCH network = (:Person {name:"JARED KUSHNER"})-[*..2]-()
RETURN network

Data analysis of the TrumpWorld dataset looking at the second-degree Kushner network


Which Banks Are Involved with These Organizations and People?

MATCH (bank:Organization)--(other)
WHERE bank.name contains "BANK"
RETURN *

Data connections between banks and Donald Trump


Donald Trump’s Connections to Vladimir Putin

And finally let’s look at the shortest paths between Vladimir Putin and Donald Trump.

MATCH (vp:Person {name:"VLADIMIR PUTIN"}),(dt:Person {name:"DONALD J. TRUMP"})
MATCH path = allShortestPaths( (vp)-[*]-(dt) )
RETURN path

Shortest path between Donald Trump and Vladimir Putin in Neo4j


As expected, the shortest path between Putin and Trump officially goes through Rex Tillerson. But as you can also see, there is a missing direct connection between Trump and Tillerson, the current nominee for U.S. Secretary of State.

As we’ll see next, the other nominees are there, so this is one bit of missing information that I reported back to Buzzfeed.

Other Nominees of Trump’s Cabinet

MATCH (p:Person)-[con:RELATED_TO]->()
WHERE con.connection CONTAINS "Nominee"
RETURN p.name, con.connection
ORDER BY split(p.name," ")[-1] DESC LIMIT 5

p.name con.connection
RYAN ZINKE Nominee for Secretary of Interior
JEFF SESSIONS Nominee for Attorney General
WILBUR ROSS Nominee for Secretary of Commerce
ANDY PUZDER Nominee for Secretary of Labor
SCOTT PRUITT Nominee for Administrator of the Environmental Protection Administration (EPA)
You see that Rex Tillerson should be the second in this list, sorted reverse by last name.

Nominee Influence

And finally we can look at the combined influence that these nominees exert within the TrumpWorld graph – at least from what is in the data we have here:

MATCH (o:Organization)<-[:INVOLVED_WITH]-(p:Person)-[con:RELATED_TO]->()
WHERE con.connection CONTAINS "Nominee"
RETURN o,p

We see that these 18 individuals are involved with 299 organizations that they either work with, own or control, are indebted to or are otherwise involved with Donald Trump.

Influencer analysis of Donald Trump's cabinet nominee network


Fun Graph Query


To conclude, I want to demonstrate what graph databases are quite good at – quickly finding transitive relationships within data. First, I looked for the longest shortest paths in our graph, checking a cross product of roughly 17,000 pairs of people. This finished in little over three seconds on my laptop.

match (p1:Person), (p2:Person) where id(p1) < id(p2)
match p = shortestPath((p1)-[*]-(p2))
return p1,p2,length(p) order by length(p) desc limit 20

╒══════════════════════════════╤═══════════════════════════╤═════════════╕
│"p1"                        │"p2"                    │"length(p)" │
╞══════════════════════════════╪═══════════════════════════╪═════════════╡
│{"name":"VICTOR HEGGELMAN"} │{"name":"ZHUO RAN"}      │10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"VICTOR HEGGELMAN"} │{"name":"CHRIS CHRISTIE"}│10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"TIM MOL"}          │{"name":"ZHUO RAN"}      │10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"ROMA DOWNEY"}      │{"name":"ZHUO RAN"}      │10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"CHRIS CHRISTIE"}   │{"name":"ROMA DOWNEY"}   │10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"LEV LEVIEV"}       │{"name":"JOSE GARCIA"}   │9          │
└──────────────────────────────┴───────────────────────────┴─────────────┘


If we pick two of those – Lev Leviev and Roma Downey – who have probably little to do with each other in real life except her wearing diamonds, we can see who connects them in our graph.

MATCH (p1:Person {name:"LEV LEVIEV"}), (p2:Person {name:"ROMA DOWNEY"})
MATCH path = allShortestPaths((p1)-[*]-(p2))
RETURN path

Shortest path between Lev Leviev and Roma Downey in the TrumpWorld dataset


Extending the TrumpWorld Graph


There are many ways this graph can be extended, and I'll work on some of them in the coming weeks. Here are some ideas:
    • Combine with related data from littlesis.org, "the involuntary registry of the 1%"
    • Add FEC campaign financing information
    • Add foundation detail data from citizenaudit.org
    • Add people involved with the election campaign
    • Add second-degree organizations related to the other people in the graph, not just Donald J. Trump
    • Add more detailed information on the organizations and their setup (location, leadership, deals), e.g. by querying APIs like OpenCorporates or by integrating with the data from Aleph the OCCRP data portal
If you want to collaborate or contribute to this, please get in touch with us.

Our partner Linkurious was inspired by the first part of my work to demonstrate how you'd visualize this data with their software.


New to the world of graph technology?
Click below to get your free copy the O’Reilly Graph Databases book and discover how to harness the power of graph databases for your next project.


Get My Free Copy