Knowledge Base

How do I compare two graphs for equality

If you are looking to compare 2 graphs (or sub-graphs) to determine if they are equivalent, the following Cypher will produce a md5sum of the nodes and properties to make that comparison. For example, you may wish to compare a test/QA instance with a production instance.

Neo4j 3.1 forward

MATCH (n:Movie)
WITH n
ORDER BY n.title
WITH collect(properties(n)) AS propresult
RETURN apoc.util.md5(propresult);

pre 3.1

MATCH (n:Movie)
WITH n
ORDER BY n.title
WITH collect(properties(n)) AS propresult
CALL apoc.util.md5(propresult) YIELD value AS md5_property
RETURN md5_property

and when run against the default Movie Graph which includes 38 nodes with a label of Movie, this returns:

md5_property
3f8d4737d078783e12f7cf57a207dd67

The above Cypher requires the installation of the apoc stored procedures set.

In the above example, we are examining all nodes with the label :Movie and producing a md5sum of all properties those nodes, using that sum to produce a md5sum hash.

To get correct results we need to order the nodes by a property value that is both defined for each node and unqiue. For this reason you might want to use a property that is defined as a property existence constraint and unique property constraint.

For example if the :Movie nodes had multiple nodes with the same title property, and since the Cypher above is ordering by n.title, then the results are passed to the md5 stored procedure in the order they are found. This is typically based upon the order the nodes were created. If you had two :Movie nodes with title='The Matrix' created with the following Cypher:

CREATE (n:Movie {title:'The Matrix', genre:'Sci-Fi'})
CREATE (n1:Movie {title:'The Matrix', genre:'Action'})

then simply running the Cypher to produce the md5 hash will produce a md5_property of:

md5_property
5bc18a680ef59ba09466da4217166d30

However, if you reversed the order of the CREATE statements, like this:

CREATE (n1:Movie {title:'The Matrix', genre:'Action'})
CREATE (n:Movie {title:'The Matrix', genre:'Sci-Fi'})

the result of the same md5 hashing Cypher will yield a different md5_property:

md5_property
c3c565b45457d2182731050e0cbab221

In the above example, so as to get the correct md5 values, regardless of the order of the creates, we need to run Cypher which will return data in a guaranteed order, using an ORDER BY clause:

MATCH (n:Movie)
WITH n
ORDER BY n.name, n.genre
WITH collect(properties(n)) AS propresult
CALL apoc.util.md5(propresult) YIELD value AS md5_property
RETURN md5_property

which will always return:

md5_property
c3c565b45457d2182731050e0cbab221
Additionally, we cannot simply collect(n) (i.e. the entire node) for internally it includes the internal node id (a unique internal identifier).

If you run the same Cypher on two separate environments and get the same md5 sums, the nodes can be proven to be the same in terms of definition of labels and properties.