Exploring the U.S. National Bridge Inventory with Neo4j — Part 4


Jared Erondu on Unsplash

Exploring the U.S. National Bridge Inventory with Neo4j — Part 4: Cleaning up Counties, Places, Bridges

In part 3, we began building a hierarchical structure making connections from bridges to places to counties to states. We ended with more :County nodes in the graph than anticipated. In part 4, we will identify and cleanup “invalid” nodes. There is much more to this than we can cover in a series like this. We will focus on some of the basic assumption we made regarding the data itself.

“Invalid” Nodes
In the context of our exploration, the term “invalid” simply refers to nodes that are incorrect due to our interpretations or assumption about the data. It doesn’t mean the raw data itself is incorrect (that’s an entirely different discussion).

We made two assumptions when we first began.

  1. Counties never change. This isn’t something one generally considers. In fact, I only learned of this when trying to figure out what was wrong. This is beyond the scope of what we’ll be reviewing.
  2. Raw data is reliable. This one hits closer to home and is what most the readers will be more familiar with. Let’s focus here.

According to the encoding document the STATE_CODE_001, COUNTY_CODE_003, PLACE_CODE_004, and STRUCTURE_NUMBER_008 raw data fields should consist of two, three, five, and fifteen digits, respectively. Armed with that clarification, we will add :Invalid labels to the (:Bridge) , (:Place) , and (:County) nodes that violate these string restrictions.

// Set Invalid Bridges
MATCH (b:Bridge)
WHERE size(b.countyCode) <> 3
OR size(b.placeCode) <> 5
OR size(b.code) <> 15
SET b:Invalid;
// Set Invalid Places
MATCH (p:Place)
WHERE size(p.countyCode) <> 3
OR size(p.code) <> 5
SET p:Invalid;
// Set Invalid Counties
MATCH (c:County)
WHERE size(c.code) <> 3
SET c:Invalid;

Now we can query the path (:State)-[:HAS_COUNTY]->(:County)-[:HAS_PLACE]->(:Place)-[:HAS_BRIDGE]->(:Bridge) to get a breakdown and count of the nodes with the :Invalid label:

MATCH (s:State)-[:HAS_COUNTY]->(c:County)-[:HAS_PLACE]->(p:Place)-[:HAS_BRIDGE]->(b:Bridge)
WITH s, c, p, b
RETURN s.code AS sCode,
NOT c:Invalid AS validCounty,
NOT p:Invalid AS validPlace,
NOT b:Invalid AS validBridge,
count(*) AS count
ORDER BY sCode, validCounty, validPlace, validBridge
“Valid” node counts

Based on how we built the hierarchical tree structure starting with the (:Bridge) , if a (:County) is :Invalid , then every connected (:Place) and (:Bridge) are also :Invalid . Similarly, if a (:Place) is :Invalid , then every connected (:Bridge) are also :Invalid ; however, the connected (:County) may or may not be :Invalid . It is impossible to have an :Invalid parent node with a valid child node along the (:State)-[:HAS_COUNTY]->(:County)-[:HAS_PLACE]->(:Place)-[:HAS_BRIDGE]->(:Bridge) .

Reprocessing :Invalid Nodes
With the :Invalid label added to the graph we can reprocess the data with our improved logic.

  1. Add processing label to the (:Row) nodes connected to the (:Bridge:Invalid) nodes:
// Query `(:Bridge:Invalid)` and set a processing label on the connected `(:Row)`:
CALL apoc.periodic.iterate(

MATCH (b:Bridge:Invalid)
RETURN b
‘,’
MATCH (b)<-[:DATA_FOR]-(row)
SET row:ConnectRowToBridge
‘,
{batchSize:1000,parallel:true})

2. Reprocess (:Row:ConnectRowToBridge) :

//Reprocess (:Row:ConnectRowToBridge) nodes with updated logic: 
CALL apoc.periodic.iterate(

WITH 3 AS cCodeSize,
5 AS pCodeSize,
15 AS bCodeSize
MATCH (row:ConnectRowToBridge)
WITH row,
cCodeSize,
pCodeSize,
bCodeSize,
cCodeSize — size(row.COUNTY_CODE_003) AS cDiff,
pCodeSize — size(row.PLACE_CODE_004) AS pDiff,
bCodeSize — size(row.STRUCTURE_NUMBER_008) AS bDiff
RETURN row.STATE_CODE_001 AS sCode,
apoc.text.repeat(“0”,cDiff) + row.COUNTY_CODE_003 AS cCode,
apoc.text.repeat(“0”,pDiff) + row.PLACE_CODE_004 AS pCode,
apoc.text.repeat(“ “,bDiff) + row.STRUCTURE_NUMBER_008 AS bCode
‘,’
MERGE (bridge:Bridge {stateCode: sCode,
countyCode: cCode,
placeCode: pCode,
code: bCode})
ON CREATE SET bridge:ConnectToPlace
‘,
{batchSize:10000, parallel:false})

We pad the COUNTY_CODE_003 and PLACE_CODE_004 with 0’s, and the STRUCTURE_NUMBER_008 with spaces. This is just to address the string length requirement and does not address any other value validation step.

With the new (:Bridge) nodes created from the updated logic we can continue our processing by creating (:Bridge)<-[:DATA_FOR]-(:Row) :

CALL apoc.periodic.iterate(

WITH 3 AS cCodeSize,
5 AS pCodeSize,
15 AS bCodeSize
MATCH (row:ConnectRowToBridge)
WITH row,
cCodeSize,
pCodeSize,
bCodeSize,
cCodeSize — size(row.COUNTY_CODE_003) AS cDiff,
pCodeSize — size(row.PLACE_CODE_004) AS pDiff,
bCodeSize — size(row.STRUCTURE_NUMBER_008) AS bDiff
RETURN row,
row.STATE_CODE_001 AS sCode,
apoc.text.repeat(“0”,cDiff) + row.COUNTY_CODE_003 AS cCode,
apoc.text.repeat(“0”,pDiff) + row.PLACE_CODE_004 AS pCode,
apoc.text.repeat(“ “,bDiff) + row.STRUCTURE_NUMBER_008 AS bCode
‘,’
WITH row, sCode, cCode, pCode, bCode
MATCH (bridge:Bridge {stateCode: sCode,
countyCode: cCode,
placeCode: pCode,
code: bCode})
CREATE (row)-[:DATA_FOR]->(bridge)
WITH row
REMOVE row:ConnectRowToBridge
‘,
{batchSize:1000,parallel:false})

At this stage, we can rerun the previous queries building and connecting the appropriate (:Place) and (:County) nodes. Let’s note that we do not need to update the logic for those steps since we have rectified the errors and assumptions made at the (:Bridge) level.

Now we can delete all of the (:Invalid) nodes from the graph:

CALL apoc.periodic.iterate(

MATCH (inv:Invalid)
RETURN inv
‘,’
DETACH DELETE inv
‘,
{batchSize:1000,parallel:false})

Lessons Learned

  • Use your resources! Lucky for us we have the encoding document readily available for reference. That isn’t always the case. You will likely have to revisit logic the correct mistakes you have made. That’s one of the great things about graph databases and Neo4j, you can do that pretty darn easily.
  • Validation checks are nice. It isn’t always easy to do depending on the data you are working with, but finding a way to add validation check steps into your ETL process can help save you a lot of time before running long import queries. It can also be a good way to incorporate someone with better expertise on the project to help review data as well.
  • UI is great. Seeing data through a different lens can also help provide improved understanding to the data you are looking at. Being able to review the data in a way that relates to what is being represented by the data can help provide additional direction. This is something we’ll encounter in a later post.

We now have an improved (:State)-[:HAS_COUNTY]->(:County)-[:HAS_PLACE]->(:Place)-[:HAS_BRIDGE]->(:Bridge) hierarchical structure, but a bridge is a thing that exists in a location. In part 5 of the series we look at a critical piece of the bridge data, its longitude and latitude.


Exploring the U.S. Nation Bridge Inventory with Neo4j — Part 4 was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.