UK Companies Data
UK Companies Data
This Neo4j Browser guide will walk you through the process of importing and querying UK company registration, land ownership, and political donation data. We’ll learn Cypher along the way!
Be sure to refer to the Cypher Ref Card as you work through the examples.
Outline:
-
Data Import
-
Querying With
MATCH
-
Using Relationships
-
Fuzzy Matches
-
Working With Numbers
-
Dates
-
Location Data
-
Graph Algorithms
Import
Be sure to enable multi-statement query editor in Neo4j Browser, then run:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
//Constraint for a node key is a Neo4j Enterprise feature only - run on an instance with enterprise
//CREATE CONSTRAINT ON (p:Person) ASSERT (p.birthMonth, p.birthYear, p.name) IS NODE KEY
CREATE CONSTRAINT ON (p:Property) ASSERT p.titleNumber IS UNIQUE;
LOAD CSV WITH HEADERS FROM "https://guides.neo4j.com/ukcompanies/data/PSCAmericans.csv" AS row
MERGE (c:Company {companyNumber: row.company_number})
RETURN COUNT(*);
LOAD CSV WITH HEADERS FROM "https://guides.neo4j.com/ukcompanies/data/PSCAmericans.csv" AS row
MERGE (p:Person {name: row.`data.name`, birthYear: row.`data.date_of_birth.year`, birthMonth: row.`data.date_of_birth.month`})
ON CREATE SET p.nationality = row.`data.nationality`,
p.countryOfResidence = row.`data.country_of_residence`
// TODO: Address
RETURN COUNT(*);
LOAD CSV WITH HEADERS FROM "https://guides.neo4j.com/ukcompanies/data/PSCAmericans.csv" AS row
MATCH (c:Company {companyNumber: row.company_number})
MATCH (p:Person {name: row.`data.name`, birthYear: row.`data.date_of_birth.year`, birthMonth: row.`data.date_of_birth.month`})
MERGE (p)-[r:HAS_CONTROL]->(c)
SET r.nature = split(replace(replace(replace(row.`data.natures_of_control`, "[",""),"]",""), '"', ""), ",")
RETURN COUNT(*);
LOAD CSV WITH HEADERS FROM "https://guides.neo4j.com/ukcompanies/data/CompanyDataAmericans.csv" AS row
MATCH (c:Company {companyNumber: row.` CompanyNumber`})
SET c.name = row.CompanyName,
c.mortgagesOutstanding = toInteger(row.`Mortgages.NumMortOutstanding`),
c.incorporationDate = Date(Datetime({epochSeconds: apoc.date.parse(row.IncorporationDate,'s','dd/MM/yyyy')})),
c.SIC = row.`SICCode.SicText_1`,
c.countryOfOrigin = row.CountryOfOrigin,
c.status = row.CompanyStatus,
c.category = row.CompanyCategory;
LOAD CSV WITH HEADERS FROM "https://guides.neo4j.com/ukcompanies/data/ElectionDonationsAmericans.csv" AS row
MATCH (c:Company) WHERE c.companyNumber = row.CompanyRegistrationNumber
MERGE (p:Recipient {name: row.RegulatedEntityName})
SET p.entityType = row.RegulatedEntityType
MERGE (c)-[r:DONATED {ref: row.ECRef}]->(p)
SET r.date = Date(Datetime({epochSeconds: apoc.date.parse(row.ReceivedDate,'s','dd/MM/yyyy')})),
r.value = toFloat(replace(replace(row.Value, "£", ""), ",", ""));
LOAD CSV WITH HEADERS FROM "https://guides.neo4j.com/ukcompanies/data/LandOwnershipAmericans.csv" AS row
MATCH (c:Company {companyNumber: row.`Company Registration No. (1)`})
MERGE (p:Property {titleNumber: row.`Title Number`})
SET p.address = row.`Property Address`,
p.county = row.County,
p.price = toInteger(row.`Price Paid`),
p.district = row.District
MERGE (c)-[r:OWNS]->(p)
WITH row, c,r,p WHERE row.`Date Proprietor Added` IS NOT NULL
SET r.date = Date(Datetime({epochSeconds: apoc.date.parse(row.`Date Proprietor Added`,'s','dd-MM-yyyy')}));
CREATE INDEX ON :Company(incorporationDate);
Verify Data Import
Verify the imported data model by running:
CALL db.schema.visualization();
Person
-
name
-
nationality
-
countryOfResidence
-
birthYear
-
birthMonth
Relationships
(:Person)-[:HAS_CONTROL]→(:Company)
Company
-
companyNumber
-
name
-
status
-
SIC
-
countryOfOrigin
-
incorporationDate
-
mortgagesOutstanding
Relationships
-
(:Company)-[:OWNS]→(:Property)
-
(:Company)-[:DONATED]→(:Recipient)
Property
-
titleNumber
-
address
-
county
-
district
-
price
Recipient
-
name
-
entityType
Querying With MATCH
Now that we’ve imported our data, it’s time to query it!
In Neo4j, we use the MATCH
command to query data. The syntax is MATCH
, followed by a graph pattern. For example:
MATCH (p:Person {name: "Margery Kraus"})
RETURN p
-
The
()
represent a node -
:Person
is the node label -
{}
indicate properties -
{name: ""}
-
p
becomes a variable that is bound to pieces of the graph that match the pattern -
We use
RETURN
to return data and visualize the results
Querying With MATCH
- Exercise
Now it’s your turn!
-
Find the
Person
node with the nameMichael Rubens Bloomberg
-
Find the
Company
with the nameGRAPHIC PLC
Remember the basic format for MATCH
:
MATCH (variable:NodeLabel {property: "value"})
RETURN variable
Querying With MATCH
- Answers
Find the Person
node with the name Michael Rubens Bloomberg
MATCH (p:Person {name: "Michael Rubens Bloomberg"})
RETURN p
Find the Company
with the name GRAPHIC PLC
MATCH (c:Company {name: "GRAPHIC PLC"})
RETURN c
Using Relationships
Nodes are connected by relationships. We can define more complex graph patterns that include relationships in our MATCH
statements using square brackets []
to define the relationship. For example:
MATCH (p:Person {name: "Margery Kraus"})-[:HAS_CONTROL]->(c:Company)
RETURN p, c
-
Note the
-[:HAS_CONTROL]→
pattern
Using Relationships - Exercise
-
Can you find the companies connected to Michael Rubens Bloomberg?
-
Do these companies connected to Michael Ruben Bloomberg own any properties?
-
Did these companies connected to Michael Ruben Bloomberg make any political donations?
Using Relationships - Answers
Companies connected to Michael Rubens Bloomberg?
MATCH (p:Person {name: "Michael Rubens Bloomberg"})-[:HAS_CONTROL]->(c:Company)
RETURN p, c
Do these companies own any properties?
MATCH (p:Person {name: "Michael Rubens Bloomberg"})-[:HAS_CONTROL]->(c:Company)-[:OWNS]->(pr:Property)
RETURN p, c, pr
Did these companies make any political donations?
MATCH (p:Person {name: "Michael Rubens Bloomberg"})-[:HAS_CONTROL]->(c:Company)-[:DONATED]->(r:Recipient)
RETURN p, c, r
Fuzzy Matches
We’ve seen how to do exact comparisons, but what about "fuzzy" matches? For example, what if we didn’t know Michael Bloomberg’s middle name? Or wanted to take slight misspellings into account?
We have a few options for non-exact matches:
-
The
CONTAINS
string comparison operator -
Regular expressions
-
True fuzzy match with a full text index
Fuzzy Matches - CONTAINS
The CONTAINS
string comparison operator can be used to match on strings that contain sub-strings.
To take advantage of CONTAINS
, we need to introduce the WHERE
clause. We can use any boolean expression in a WHERE
clause to filter matches. For example:
MATCH (p:Person)
WHERE p.name CONTAINS "Bloomberg"
RETURN p
Fuzzy Matches - Regular Expression
We can also use regular expressions.
This is equivilent to using a CONTAINS
:
MATCH (p:Person)
WHERE p.name =~ ".*Bloomberg.*"
RETURN p
We can also do case-insensitive:
MATCH (c:Company)
WHERE c.name =~ "(?i)graphic.*"
RETURN c
See the Regular Expression docs for more examples.
Fuzzy Matches - Full Text Index
A full-text index can help us make true fuzzy comparisons - taking into account mispellings.
First, we must create the full text index:
CALL db.index.fulltext.createNodeIndex("nameIndex", ["Person"], ["name"])
Then we can query it:
CALL db.index.fulltext.queryNodes("nameIndex", "Peterson~")
Note the ~
in the name. This indicates we should match on slight misspellings of our search term. Read more about the query syntax for fuzzy matching here.
Fuzzy Matches - Exercise
1) Contains
Find Abigail Johnson and any companies she is connected to. Hint: The data might contain title prefixing names (Mr, Mrs, Ms, etc), so we’ll need to take that intro account.
2) Regular expression
We want to find all properties in London; however, we notice that the address
property on the Property
nodes has both "London" and "LONDON". Write a query using a regular expression to find all Property
nodes in London.
Fuzzy Matches - Answers
1) Contains
MATCH (p:Person)-[:HAS_CONTROL]->(c:Company)
WHERE p.name CONTAINS "Abigail Johnson"
RETURN p,c
2) Regular expression
MATCH (c:Company)
WHERE c.name =~ "(?i).*london.*"
RETURN c
Working With Numbers
Storing property values as numbers is useful for answering questions like:
Show me all political donations between 1,000 and 10,000 pounds
MATCH (c:Company)-[d:DONATED]->(r:Recipient)
WHERE 1000 < d.value < 10000
RETURN c,d,r
Note that here we are accessing a property on a relationship!
Show me all properties in London with a value over 10 million pounds that are owned by a company controlled by an American:
MATCH path=(prop:Property)<-[:OWNS]-(:Company)<-[:HAS_CONTROL]-(per:Person)
WHERE prop.price > 10000000 AND prop.address =~ "(?i).*London.*"
AND per.nationality = "American"
RETURN path
For a given individual, what is the total amount of political donations made by companies they control?
MATCH (p:Person {name: "Ms Abigail Johnson"})-[:HAS_CONTROL]->(c:Company)-[d:DONATED]->(:Recipient)
RETURN sum(d.value) AS totalDonations, p.name AS person, c.name AS company
Here we perform an aggregation, summing the value
property of all DONATED
relationships matched in our pattern. Read more about aggregation functions in Cypher here.
Working With Numbers - Exercise
-
Find the total value of campaign donations made by companies controlled by Michael Bloomberg.
-
What parties were those donations made to? How much in total to each party?
-
Of the companies connected to Bloomberg, which made the most campaign donations?
Working With Numbers - Answers
Find the total value of campaign donations made by companies controlled by Michael Bloomberg.
//Find the total value of campaign donations made by companies controlled by Michael Bloomberg.
// First be sure to find all Michael Bloombergs in the data
MATCH (p:Person)
WHERE p.name =~ "(?i).*Michael.*Bloomberg.*"
// Find all companies connected to Bloomberg and donations
MATCH (p)-[:HAS_CONTROL]->(c:Company)-[r:DONATED]->(party:Recipient)
// Aggregate the value property of all donations from these companies
RETURN sum(r.value) AS total
What parties were those donations made to? How much in total to each party?
MATCH (p:Person)
WHERE p.name =~ "(?i).*Michael.*Bloomberg.*"
MATCH (p)-[:HAS_CONTROL]->(c:Company)-[r:DONATED]->(party:Recipient)
// When we add party.name to the RETURN clause we group our sum aggregation by party.name
RETURN party.name, sum(r.value) AS total
ORDER BY total DESC
Of the companies connected to Bloomberg, which made the most campaign donations?
MATCH (p:Person)
WHERE p.name =~ "(?i).*Michael.*Bloomberg.*"
MATCH (p)-[:HAS_CONTROL]->(c:Company)-[r:DONATED]->(party:Recipient)
RETURN c.name, sum(r.value) AS total
ORDER BY total DESC
Dates
Dates are treated as a special type in Neo4j and have their own functions in Cypher.
For example, to construct a Date:
RETURN date("2019-03-06")
We can filter for events in a date range like this:
MATCH (c:Company)
WHERE date("2017-01-01") < c.incorporationDate < date("2017-01-15")
RETURN c
Dates - Answer
MATCH (c:Company)-[r:DONATED]->(party:Recipient)
WHERE Date("2016-01-01") < c.incorporationDate
RETURN c
Location Data
The source data has addresses. If we are able to convert these addresses to longitude/latitude, we could search for properties that are close together, within some range of a point, or polygon, or create interactive geographic data visualizations.
Fortunately, we can accomplish this with use of Neo4j’s geocoding procedures.
CALL apoc.spatial.geocodeOnce("6 Anchorage Terrace, Durham (DH1 3DL)") YIELD location, latitude, longitude
We can update the Property
nodes with a new property location
that is a Point type:
MATCH (p:Property) WITH p LIMIT 1
CALL apoc.spatial.geocodeOnce(p.address) YIELD location, latitude, longitude, description
SET p.location = Point({latitude: latitude, longitude: longitude})
RETURN p
Note: you’ll need to have write access to the database
Location Data - Search
Find properties in the dataset within 10km of Neo4j’s London office:
MATCH path=(p:Property)<-[:OWNS]-(:Company)<-[:HAS_CONTROL]-(:Person)
WHERE distance(p.location, Point({latitude:51.5122338, longitude:-0.1180369})) < 10000
RETURN path
Open Ended Exercise
Let’s say you are researching Stephen A. Schwarzman, the CEO of the Blackstone group. What can you find about him in the data?
Is this page helpful?