Knowledge Base

Understanding how MERGE works

What is MERGE, and how does it work?

The MERGE clause ensures that a pattern exists in the graph. Either the entire pattern already exists, or the entire pattern needs to be created.

In this way, it’s helpful to think of MERGE as attempting a MATCH on the pattern, and if no match is found, a CREATE of the pattern.

When the specified pattern is not present and needs to be created, any variables previously bound to existing graph elements will be reused in the pattern. All other elements of the pattern will be created.

It’s important to know which pattern elements will use existing graph elements and which will be created instead.

For the following examples, we’ll use a very simplistic graph of :Student, :Class, :ReportCard, and :Term nodes as follows.

(:Student)-[:ENROLLED_IN]->(:Class)
(:Student)-[:EARNED]->(:ReportCard)
(:Class)-[:FOR_TERM]->(:Term)

There will be references to "bound variables" in this article, and this refers to variables that have been bound to existing elements in the graph from earlier clauses (usually MATCH or MERGE) that are being reused in a MERGE pattern. This means that in the MERGE pattern, it refers to an element that was previously found and currently exists in the graph.

In contrast, a "new variable" in a pattern refers to a variable that is being introduced for the first time in the pattern. As such, it does not yet refer to any elements currently in the graph, but as a result of the MERGE (matching to existing elements or creating new ones) it will become bound to graph elements.

Any element in the pattern that isn’t associated with a bound variable (and this includes both elements using new variables, as well as elements without variables at all) will result in new created elements in the graph if the MERGE has to create the entire pattern.

A MERGE without bound variables can create duplicate elements

The most common MERGE mistake is attempting to MERGE a pattern with no bound variables when you want to use existing graph elements.

For example, attempting to enroll an existing student in an existing class.

MERGE (student:Student{id:123})-[:ENROLLED_IN]->(class:Class{name:'Cypher101'})

In the above query, student and class are new variables, they haven’t been previously bound to any nodes in the graph, this is their first use in the query.

If the entire pattern already exists (the given student is already enrolled in the given class), the variables will be bound to the existing nodes in the graph as expected.

However, if the pattern doesn’t already exist, all new elements of the pattern will be created. In this case, all of them; a new :Student node will be created with the given id, and a new :Class node will be created with the given name, and a new :ENROLLED_IN relationship will be created between these brand new nodes.

This may result in the creation of duplicate nodes, if such a student already exists, or such a class already exists. If there is a unique constraint for the student or class nodes for the given properties, then an error will be thrown. Otherwise, the duplicate nodes will be created, which might escape notice, especially for novice users.

A MERGE with bound variables reuses existing graph elements

To use the existing nodes and relationships in the graph, MATCH or MERGE on the nodes or relationships first, and then MERGE in the pattern using the bound variables.

A correct version of the enrollment query from above will MATCH on the student and class first, and then MERGE the relationship.

MATCH (student:Student{id:123})
MATCH (class:Class{name:'Cypher101'})
MERGE (student)-[:ENROLLED_IN]->(class)

Similarly, you could MERGE on the student and class prior to the MERGE on the relationship between them.

MERGE (student:Student{id:123})
MERGE (class:Class{name:'Cypher101'})
MERGE (student)-[:ENROLLED_IN]->(class)

This ensures that the student and class nodes exist (they will be created if they don’t exist already), and then the relationship is merged between them.

MERGE using combinations of bound and new elements for different use cases

While the above approach is correct for that particular use case, it’s not the right approach for all use cases. We may need to use combinations of bound and new elements in the MERGE for correct behavior.

Consider a query creating report cards for students.

If we reused the above approach, the query may look like this.

MATCH (student:Student{id:123})
MERGE (reportCard:ReportCard{term:'Spring2017'})
MERGE (student)-[:EARNED]->(reportCard)

The problem in this query is that the same :ReportCard node is being reused for all students. If the query also needed to add grades to the :ReportCard, each subsequent entry would overwrite what was added before. If not caught, this approach would end up with all students having the exact same report card node, and thus the exact same grades, the grades entered by the last student processed.

What we really need is a separate :ReportCard per student. We can achieve this by binding the :Student node, but not the :ReportCard node.

MATCH (student:Student{id:123})
MERGE (student)-[:EARNED]->(reportCard:ReportCard{term:'Spring2017'})

Since the student variable is bound to a node, that node will be used if the pattern needs to be created, and the :ReportCard will be created just for this student, not shared among all students.

Note that we get the exact same behavior if we omit the reportCard variable entirely, since an element without a variable and an element with a newly-introduced variable are handled the exact same way, when the pattern needs to be created:

MATCH (student:Student{id:123})
MERGE (student)-[:EARNED]->(:ReportCard{term:'Spring2017'})

Remember new relationships will be created too

The above examples should be easy to understand for nodes, but remember they apply to relationships too. New relationships in a pattern will be created if the entire pattern doesn’t exist. It’s easiest to see this in action when we use a larger pattern.

Consider if we needed to enroll a :Student in a :Class for a specific :Term. Let us assume the student, term, and class exist in the graph already.

MATCH (student:Student{id:123})
MATCH (spring:Term{name:'Spring2017'})
MATCH (class:Class{name:'Cypher101'})
MERGE (student)-[:ENROLLED_IN]->(class)-[:FOR_TERM]->(spring)

This may look correct, and may behave correctly when neither of those relationships exist ahead of time, but if there is any imbalance here where only half of the pattern exists, then this will end up creating duplicate elements. This would be most apparent if the query was run for multiple students, instead of just one.

On the first run, for the first student, the relationships would be created as expected.

On the next run, for the next student, there is a :FOR_TERM relationship between the class and the term, but the student isn’t enrolled in the class. Since the entire pattern doesn’t exist, the entire pattern will be created (excluding bound nodes), so the student will be enrolled in the class, and new (duplicate) :FOR_TERM relationship will be created between the class and the term.

If the query was run 30 times to enroll 30 students for the same class and term, there would be 30 :FOR_TERM relationships between the class and the term by the time we finished.

To fix this, if it’s known that the :FOR_TERM relationship already exists between each :Class and its :Term, then move that part of the pattern into the MATCH.

MATCH (student:Student{id:123})
MATCH (class:Class{name:'Cypher101'})-[:FOR_TERM]->(spring:Term{name:'Spring2017'})
MERGE (student)-[:ENROLLED_IN]->(class)

If we don’t know if the :FOR_TERM relationship exists already, then we would have to break this down further, MATCHing on the student, class, and term, then doing the MERGE for :FOR_TERM, then the MERGE for :ENROLLED_IN.

The takeaway here is that the MERGE of longer patterns should generally be avoided. If parts of the pattern exist but not the entire pattern, you will likely see duplicates, so consider breaking down the MERGE of larger patterns into separate MERGEs on smaller patterns.

Use ON MERGE and ON CREATE after MERGE to SET properties according to MERGE behavior

Often after a MERGE we need to SET properties on elements of the pattern, but we may want to conditionally set these properties depending on if MERGE matched to an existing pattern, or had to create it. For example, if we have default property values we want to set on creation.

The ON MERGE and ON CREATE clauses give us the control we need. This also makes it possible to re-run queries and ensure we aren’t overriding existing properties with defaults.

Here’s an example when setting student report cards and grades, assuming {grades} is a map parameter of grades we want to set when creating a new :ReportCard node.

MATCH (student:Student{id:123})
MERGE (student)-[:EARNED]->(reportCard:ReportCard{term:'Spring2017'})
ON CREATE SET reportCard += {grades}

MERGE acquires locks on nodes and relationships in the pattern when resulting in pattern creation

When MERGE fails to find an existing pattern, it will acquire locks on all bound nodes and relationships in the pattern before creating the missing elements of the pattern.

This is to ensure a MERGE or CREATE can’t concurrently create the pattern (or alter the properties of an existing pattern to make it identical to the desired pattern) while the MERGE is executing, which would cause pattern duplication.

After locking on bound elements, MERGE performs another MATCH on the pattern to avoid race conditions where the pattern might get created in the time gap between when MERGE determined the pattern doesn’t exist, and when locks were acquired.

MERGE on a single node pattern may create duplicates unless there is a unique constraint

When performing MERGE on a single-node pattern when the node does not yet exist (and there is no unique constraint), there is nothing to lock on to avoid race conditions, so concurrent transactions may result in duplicate nodes.

For example:

MERGE (student:Student{id:123})

If there is no unique constraint, then concurrent executions of this query (or any other query that may be doing a MERGE or CREATE involving a :Student node with this id) may result in multiple nodes for the same student.

By adding a unique constraint on :Student(id), schema locks will ensure that the student nodes are unique, and no duplicates will occur.

MERGE on a larger pattern not using bound nodes may also create duplicates

In the above case of a single node (when the node doesn’t yet exist), we don’t have any bound nodes to use for locking. Similarly, if the MERGE is on a pattern larger than a single node, where the entire pattern doesn’t exist, and no bound nodes are used in the pattern, there is nothing to lock on to avoid race conditions, so it faces the same risk of duplication with concurrent writes.

MERGE (class:Class{name:'Cypher101'})-[:FOR_TERM]->(spring:Term{name:'Spring2017'})

Note for Neo4j < 3.0.9 for 3.0.x versions, and < 3.1.2 for 3.1.x versions

A bug in the cost planner for affected versions prevented double-checked locking on MERGE. This allowed race conditions which could result in duplicate patterns being created by concurrent MERGE operations, or other write operation which caused the previously non-existent pattern to exist.

We recommend upgrading to a version which includes this bug fix.