GraphGists

Working with Hierarchical Trees in Neo4j

hospitalmeta

Introduction

My name is Tomaz Bratanic. I want to demonstrate how you should approach hierarchical location trees in Neo4j. From what I have learned during importing/querying with them, I came up with a few ground rules one should follow to in order to get the correct query results.

Rules of location tree:

  • All relationships are directed from children to parents, going up the hiearchy.

  • We have a single type for all relationships. (PARENT;FROM;IS_IN)

  • Every node has a single outgoing relationship to its parent.

  • Every node can have one or multiple incoming relationships from its children.

Import

Let’s load some data into our graph to explore.

Add constraints and indexes

First, we need to add indexes and constraints, as they will optimize our queries. The first array in the procedure below sets the indexes, and the second array contains the unique constraints. You will need to have the APOC library installed.

CREATE INDEX ON :County(name);
CREATE INDEX ON :City(name);
CREATE INDEX ON :ZipCode(name);
CREATE INDEX ON :Address(name);

CREATE CONSTRAINT ON (h:Hospital) ASSERT h.id IS UNIQUE;
CREATE CONSTRAINT ON (s:State) ASSERT s.name IS UNIQUE;

Location hierarchical tree import

You can notice that we do not take the standard approach, where we merge each node separately, but we merge them in pattern with their parent in a hierarchical tree because some counties/cities/addresses share the same name.

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/Hospital%20General%20Information.csv" as row
WITH row
WHERE row.State = 'New York'
// state name is unique
MERGE (state:State{name:row.State})
// merge by pattern with their parents
MERGE (state)<-[:IS_IN]-(county:County{name:row.`County Name`})
MERGE (county)<-[:IS_IN]-(city:City{name:row.City})
MERGE (city)<-[:IS_IN]-(zip:ZipCode{name:row.`ZIP Code`})
MERGE (zip)<-[:IS_IN]-(address:Address{name:row.Address})
// for entities, it is best to have an id system
MERGE (h:Hospital{id:row.`Provider ID`})
MERGE (h)-[:IS_IN]->(address)

Additional hospital information

We will also import some additional information about the hospitals such as their ratings, ownership, and more.

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/Hospital%20General%20Information.csv" as row
WITH row
WHERE row.State = 'New York'
MATCH (h:Hospital{id:row.`Provider ID`})
SET h.phone=row.`Phone Number`,
    h.emergency_services = row.`Emergency Services`,
    h.name= row.`Hospital Name`,
    h.mortality = row.`Mortality national comparison`,
    h.safety = row.`Safety of care national comparison`,
    h.timeliness = row.`Timeliness of care national comparison`,
    h.experience = row.`Patient experience national comparison`,
    h.effectiveness = row.`Effectiveness of care national comparison`
MERGE (type:HospitalType{name:row.`Hospital Type`})
MERGE (h)-[:HAS_TYPE]->(type)
MERGE (ownership:Ownership{name: row.`Hospital Ownership`})
MERGE (h)-[:HAS_OWNERSHIP]->(ownership)
MERGE (rating:Rating{name:row.`Hospital overall rating`})
MERGE (h)-[:HAS_RATING]->(rating)

Geospatial import

The last thing to import is the geospatial information of hospitals.

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/gpsinfo.csv" as row
WITH row
WHERE row.State = 'New York'
MATCH (hospital:Hospital {id:row.id})
SET hospital.latitude = toFloat(row.latitude),
    hospital.longitude = toFloat(row.longitude)

Spatial query example

Let’s say you get lost on Liberty Island and want to find the nearest 10 hospitals. Distance is in meters. Note: does not work in Neo4j Sandbox.

WITH "Liberty Island, Manhattan" as myLocation
CALL apoc.spatial.geocodeOnce(myLocation) YIELD location
WITH point({longitude: location.longitude, latitude: location.latitude}) as myPosition,100 as distanceInKm
MATCH (h:Hospital)-->(rating:Rating)
WHERE exists(h.latitude) and
  distance(myPosition, point({longitude:h.longitude,latitude:h.latitude})) < (distanceInKm * 100)
RETURN h.name as hospital,rating.name as rating,distance(myPosition,
       point({longitude:h.longitude,latitude:h.latitude})) as distance
ORDER BY distance LIMIT 10

Data Validation

Validation #1

We can check if any :Address has more than one relationship going up the hierarchy. Every node has a single outgoing relationship to its parent rule.

MATCH (a:Address)
WHERE size((a)-[:IS_IN]->()) > 1
RETURN a

Validation #2

We can also check the length of all the paths in location tree. Because of our rules we placed, every hospital must have exactly one location path because every hospital have exactly one address.

MATCH path=(h:Hospital)-[:IS_IN*..10]->(location)
WHERE NOT (location)-[:IS_IN]->()
RETURN distinct(length(path)) as length,
       count(*) as numberOfPaths,
       count(distinct(h)) as numberOfHospitals

Validation #3

Check how many labels each node has. This is useful when learning. You do not wish to have nodes without labels.

MATCH (n)
RETURN size(labels(n)) as size,count(*) as count

Queries

Lets run a few queries and learn about our data.

Average rating by ownership

MATCH (r)<-[:HAS_RATING]-(h:Hospital)-[:HAS_OWNERSHIP]->(o)
RETURN o.name as ownership,avg(toInteger(r.name)) as averageRating
ORDER BY averageRating DESC LIMIT 15

Number of hospitals per city

MATCH (h:Hospital)-[:IS_IN*3..3]->(city)
RETURN city.name as city,count(h) as NumberOfHospitals
ORDER BY NumberOfHospitals DESC LIMIT 15

Top 10 states by rating

MATCH (r)<-[:HAS_RATING]-(h:Hospital)-[:IS_IN*5..5]->(state)
WHERE NOT r.name="Not Available"
RETURN state.name as state,avg(toInteger(r.name)) as averageRating,count(h) as numberOfHospitals
ORDER BY averageRating DESC LIMIT 15

Which states have the most above-average hospitals in effectivness

MATCH (h:Hospital)-[:IS_IN*5..5]->(state)
WHERE h.effectiveness = "Above the National average"
RETURN state.name as state,h.effectiveness,count(h) as numberOfHospitals
ORDER BY numberOfHospitals DESC LIMIT 15

Which states have the most below-average hospitals in mortality

MATCH (h:Hospital)-[:IS_IN*5..5]->(state)
WHERE h.mortality = "Below the National average"
RETURN state.name as state,h.mortality,count(h) as numberOfHospitals
ORDER BY numberOfHospitals DESC LIMIT 15