Run Cypher to Analyze Neo4j Graph Database Inconsistencies


Learn how to run Cypher to analyze your data inconsistencies in Neo4j.


Have you ever wanted to check for data inconsistencies in your Neo4j graph database?

Perhaps you are merging data from different sources, or over the course of a project have changed your data model without reloading data. Or alternatively, you are asked to review someone else’s Neo4j graph database and look for issues. This blog post will look into one approach to check for data inconsistencies.

Neo4j is very flexible and forgiving when it comes to entering data. The property graph model Neo4j uses enables nodes with the same label to have different sets of properties. Let me use a short example to further explain.

Let’s say you were given a project to document the history of famous actors and actresses. You immediately think to yourself – I can just run :play movies in my Neo4j Browser and load the Cypher to get a quick start.

Here is a snippet from the movies Cypher:

CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})
CREATE (Hugo:Person {name:'Hugo Weaving', born:1960})
CREATE (LillyW:Person {name:'Lilly Wachowski', born:1967})
CREATE (LanaW:Person {name:'Lana Wachowski', born:1965})
CREATE (JoelS:Person {name:'Joel Silver', born:1952})
CREATE
  (Keanu)-[:ACTED_IN {roles:['Neo']}]-> (TheMatrix),
  (Carrie)-[:ACTED_IN {roles:['Trinity']}]-> (TheMatrix),
  (Laurence)-[:ACTED_IN {roles:['Morpheus']}]-> (TheMatrix),
  (Hugo)-[:ACTED_IN {roles:['Agent Smith']}]-> (TheMatrix),
  (LillyW)-[:DIRECTED]-> (TheMatrix),
  (LanaW)-[:DIRECTED]-> (TheMatrix),
  (JoelS)-[:PRODUCED]-> (TheMatrix)

You can see from the statements that the current data model is this:



Your data model has two Node Labels: Person and Movie. There are two relationships between Person and Movie, DIRECTED and ACTED_IN, but for the purpose of this blog post we will only focus on Node Labels.

For the Person node label, there are two properties: born and name. For the Movie node label there are three properties: released,tagline,title.

Now your team lead comes to you and says that people are also interested when actors and actresses got their first starring role. Also, the client has requested to know if a given Person has won an Oscar. And finally, you are told to drop the tagline from Movie.

To implement these requests, you modify your Cypher for new films:

CREATE (ToyStory4:Movie {title:'Toy Story 4', released:2019})
MERGE (Keanu:Person {name:'Keanu Reeves', born:1964})
SET Keanu.wonOscar = false, Keanu.filmDebut = 1985
MERGE (TomH:Person {name:'Tom Hanks', born:1956}) 
SET TomH.wonOscar = true, TomH.filmDebut = 1980
MERGE (TimA:Person {name:'Tim Allen', born:1953})
SET TimA.wonOscar = false, TimA.filmDebut = '1988 maybe?' 
MERGE (AnnieP:Person {name:'Annie Potts', born:1952})
SET AnnieP.wonOscar = false, AnnieP.filmDebut = 1978
CREATE
  (Keanu)-[:ACTED_IN {roles:['Duke Caboom (voice)']}]-> (ToyStory4),
  (TomH)-[:ACTED_IN {roles:['Woody (voice)']}]-> (ToyStory4),
  (TimA)-[:ACTED_IN {roles:['Buzz Lightyear (voice)']}]-> (ToyStory4),
  (AnnieP)-[:ACTED_IN {roles:['Bo Peep (voice)']}]-> (ToyStory4)

In the new Cypher, the ‘tagline’ on Movie has been dropped and two new properties on Person have been added: wonOscar, filmDebut.

Additionally, note that we have used MERGE here instead of CREATE to match and update existing data instead of creating duplicate nodes.

Our new model now looks like this:



By comparing pictures of the old and new data models you can see the differences in the data model. But what you cannot see is what data loaded in your database conforms to which model, and how many nodes may need to be updated.

Cypher to Look for Property Key Variations


We can actually write Cypher to examine our data for inconsistencies.

Here is one query that will look through nodes by Node Label in a Neo4j graph database and print out Node Labels having different property key sets.

/* Looks for Node Labels that have different sets of property keys */
WITH "MATCH (n:`$nodeLabel`)
WITH n LIMIT 10000
UNWIND keys(n) as key
WITH n, key
ORDER BY key
WITH n, collect(key) as keys
RETURN '$nodeLabel' as nodeLabel, count(n) as nodeCount, length(keys) as keyLen, keys
ORDER BY keyLen" as cypherQuery
CALL db.labels() YIELD label AS nodeLabel
WITH replace(cypherQuery, '$nodeLabel', nodeLabel) as newCypherQuery
CALL apoc.cypher.run(newCypherQuery, {}) YIELD value
WITH value.nodeLabel as nodeLabel, collect({
 nodeCount: value.nodeCount,
 keyLen: value.keyLen,
 keys: value.keys}) as nodeInfoList
WHERE size(nodeInfoList) > 1
UNWIND nodeInfoList as nodeInfo
RETURN nodeLabel, nodeInfo.nodeCount as nodeCount, nodeInfo.keyLen as keyLen, nodeInfo.keys as keys
ORDER BY nodeLabel, keyLen

If you had loaded the Movie data from the Neo4j sample dataset in :play movies, then ran the ToyStory4 Cypher statement, running the Cypher statement above will produce the following result:



You will see that for both Movie and Person nodes, the property keys are not the same for all instances. It is up to you to determine whether the difference in property keys is okay or not.

Sometimes it means that some nodes just didn’t have data for that property so the property wasn’t created. This is usually okay. Other times, it might mean old data is loaded that does not conform to newer changes in the data model. Running this script will at least indicate to you that there may be a problem.

Building the Query


The Cypher statement used is fairly complex. The best way to explain how it works is to recreate the Cypher statement, one step at a time, so you can see what is going on.

To start, let’s focus only on nodes with the Movie label.

We want to look in the database at all Movie nodes, and list the property keys that they have:

MATCH (n:Movie) RETURN keys(n)

This will return something like this:



We use the keys() function to examine the property keys for each node. Notice that some nodes have different sets of keys, but the property names are unordered within each set. We first need to order the keys to ensure that [title,tagline,released] is the same as [released,tagline,title]:

MATCH (n:Movie)
UNWIND keys(n) as key
RETURN key
ORDER BY key

We use UNWIND keys(n) to take each member of the set and return it as its own row. We use ORDER BY to order the keys alphabetically. A snippet of the output is below. Examining the output shows us we must figure out a way to eliminate the duplicate key names.



To eliminate the duplicate keys, we use a couple of WITH statements. The first WITH enables us to perform an ORDER BY in the middle of the statement, so we can pass an ordered set of keys to the second WITH. The second WITH uses collect(key) to collect the keys back into a list.

MATCH (n:Movie)
UNWIND keys(n) as key
WITH n, key	// need this to run ORDER BY
ORDER BY key
WITH n, collect(key) as keys
RETURN keys

Note that we must include WITH n in each of the WITH statements. In Cypher, WITH is used to perform intermediate results processing, and when used, it creates a new variable scope. Any variables you wish to preserve must be declared in the WITH so later statements can use them.

The first WITH n is just used to pass n through to the next part, to ensure the variable is still in scope.

The second WITH n is used as grouping criteria, so for each node n the keys are collected back into a list. This results in a single row for n, you will get a single row with all of the keys on one big list.

Some results from running this new statement are:



Now we can see that the keys are in order and we are getting a set of keys for every Movie node. The last thing we must do, is count up how many nodes there are for each unique key set.

The following query does this in the RETURN clause. We implicitly group on keys and keyLen (since this is just a metric derived from keys), and use count(n) to count the nodes for each unique key set.

MATCH (n:Movie)
UNWIND keys(n) as key
WITH n, key
ORDER BY key
WITH n, collect(key) as keys
RETURN count(n) as nodeCount, length(keys) as keyLen, keys
ORDER BY keyLen

Finally, we have our results and can see two nodes with [‘released’,’title’] and 37 nodes with [‘released’,’tagline’,’title’].



Running the Query for All Node Labels


Now we have a query that works for a single Node Label Movie. But the goal is to have it work for all Node Labels present in a database, and to do it without knowing what Node Labels are already present.

The first step towards this goal is to run:

CALL db.labels() YIELD label AS nodeLabel
RETURN nodeLabel

This will list every Node Label in the database.



The next step is to use the Node Labels to create individual Cypher statements for each Node Label. We can use the nodeLabel returned from db.labels() to create a Node Label specific Cypher statement.

WITH "some cypher statement" as cypherQuery
CALL db.labels() YIELD label AS nodeLabel
RETURN cypherQuery + " for " + nodeLabel

Now we must substitute in our Movie Cypher for “some cypher statement” and make the following changes:

    • Replace Movie to $nodeLabel
    • Add $nodeLabel as nodeLabel to RETURN
    • Add WITH replace(cypherQuery, '$nodeLabel', nodeLabel) as newCypherQuery after we call db.labels()
        • Making these substitutions produces the query below.

          Running the query produces individual Cypher statements for each NodeLabel. Note that even though I am using $nodeLabel, this is not an actual parameterized Cypher call, because currently you can not parameterize Node Labels.

          Instead, $nodeLabel is being used as a placeholder for a string substitution. Calling replace(…) changes our $nodeLabel placeholder with the actual nodeLabel value returned from db.labels().

          WITH "MATCH (n:`$nodeLabel`)
          UNWIND keys(n) as key
          WITH n, key
          ORDER BY key
          WITH n, collect(key) as keys
          RETURN '$nodeLabel' as nodeLabel, count(n) as nodeCount, length(keys) as keyLen, keys
          ORDER BY keyLen" as cypherQuery
          CALL db.labels() YIELD label AS nodeLabel
          WITH replace(cypherQuery, '$nodeLabel', nodeLabel) as newCypherQuery
          RETURN newCypherQuery
          

          Running this produces the following result:



          Now that we have a Cypher query for each Node Label, we can use apoc.cypher.run() to run each query. This will require APOC to be installed in your database. Please read these instructions to install APOC if it is not already configured to run in your database.

          CALL apoc.cypher.run(newCypherQuery, {}) YIELD value
          

          The value that is returned by apoc.cypher.run contains the results of the query that was executed.

          For each row returned in your Cypher query, value will produce a map where the map keys are the return variables names, and the map values are the return values. Here is an example return:

          {
            "nodeCount": 2,
            "keyLen": 2,
            "nodeLabel": "Movie",
            "keys": [
              "released",
              "title"
            ]
          }
          

          To complete our query, we must process these results to determine which Node Labels may have different property keys.

          First, we use nodeLabel as our grouping key and collect() up the other returned values. By using collect() on nodeLabel, we end up returning a single row for each Cypher query at this point. The nodeInfoList variable contains all of the other rows returned from the Cypher query.

          WITH value.nodeLabel as nodeLabel, collect({
           nodeCount: value.nodeCount,
           keyLen: value.keyLen,
           keys: value.keys}) as nodeInfoList
          WHERE size(nodeInfoList) > 1
          

          Next, we use size (nodeInfoList) > 1 in the WHERE clause to check and see if there is more than 1 row from each Cypher query. If there is only 1 row, we don’t want to return. A single row means for every node with that node label, all nodes have the same set of property keys. This indicates good data, and we only want to return node labels where there are different sets of property keys.

          The very last piece of the query converts the nodeInfoList collection back into individual rows using UNWIND. We also ORDER BY nodeLabel, and keyLen to sort alphabetically on node labels and to show smaller key sets first.

          UNWIND nodeInfoList as nodeInfo
          RETURN nodeLabel, nodeInfo.nodeCount as nodeCount, nodeInfo.keyLen as keyLen, nodeInfo.keys as keys
          ORDER BY nodeLabel, keyLen
          

          Running the completed query, produces these results (as shown previously):



          The very last piece I put in the overall query is a small addition to our embedded Cypher query. On the second line I added this:

          WITH n LIMIT 10000
          

          This provides a safeguard when running this query against large databases.

          For a given Node Label, it will only look at the first 10,000 rows. Without this safeguard, for large databases it could easily run out of memory. Feel free to adjust this limit, add a SKIP, or try some different sampling techniques if you don’t want to look at only the first 10,000 rows.

          Conclusion


          We were able to run a Cypher query to check for data inconsistencies in the Neo4j database. We looked only at differences in property keys for the same Node Label. Depending on your specific data and data model, this may or may not be okay. You will have to make that determination based on your project requirements.

          You can create a few other variants of this query: one variant can also pull back the specific rows that may have inconsistent data, and another can check for inconsistent data values within a specific property key. Try to see if you can produce the variants on your own using the power of Neo4j stored procedures, plus the execution of dynamically generated Cypher queries. These variants use the exact same techniques described in the blog post with only a little extra logic.

          In future blog posts, I’ll be visiting other queries used to check different data health aspects of your Neo4j database.


          Think you have what it takes to be Neo4j certified?
          Show off your graph database skills to the community and employers with the official Neo4j Certification. Click below to get started and you could be done in less than an hour.


          Get Certified