Cypher is a declarative query language created for querying Neo4j‘s graph data.

In Cypher, as in SQL, queries are built up using various clauses. One such clause is the UNION clause that we will discuss in this post.

The UNION clause is used to combine the similarly structured results of multiple queries and return them as one output. By default it returns the distinct values but with an UNION ALL you get all of them.

We are often asked how to do UNION post processing – like sorting, paginating or post-filtering – with the UNION results. Currently in order to do sorting or pagination, we have to apply these to all the partial queries in turn as a pre-processing, which of course takes you only halfway there, but not to the actual post-processing.

In this post we will discuss how we can do UNION post processing by rewriting the query using the COLLECT function and the UNWIND clause.

We will consider the following data model:

Learn How to Rewrite a UNION Cypher Query Using COLLECT and UNWIND Clauses


Problem Description:


We would like to get name of Person who works in the department that is led by “Scott” or name of the Person who works at the “Operations” department.

One of the ways to write a Cypher query for this problem statement is to use the UNION clause. Consider this example below:

match (n:Person)-[r:WORKS_AT]->(d:Dept)-[:LEAD_BY]->(m:Person) 
where m.name = "Scott"
return n.name as name, r.hoursPerWeek as hourPerWeek
UNION
match (n:Person)-[r:WORKS_AT]->(d:Dept)
where d.name = "Operations"
return n.name as name, r.hoursPerWeek as hourPerWeek

Output of the query:

The Cypher Query Output for a UNION Query


Now, since I can’t do UNION post processing such as sorting the data by name, I have to apply the ORDER BY clause as pre-processing to both queries that are part of the UNION clause. My query would change as follows:

match (n:Person)-[r:WORKS_AT]->(d:Dept)-[:LEAD_BY]->(m:Person) 
where m.name = "Scott"
return n.name as name, r.hoursPerWeek as hourPerWeek
order by name
UNION
match (n:Person)-[r:WORKS_AT]->(d:Dept)
where d.name = "Operations"
return n.name as name, r.hoursPerWeek as hourPerWeek
order by name

The Cypher Query Output for a COLLECT Query


As you can see in the output, sorting is not done correctly. “Ashton” comes before “Amber” in the output. That is because it sorts the first query and then sorts the second query and combines the results.

How can we resolve this issue? By using COLLECT and UNWIND as a power-combo. COLLECT collects values into a list (a real list that you can run list operations on). UNWIND transforms the list back into individual rows.

First we turn the columns of a result into a map (struct, hash, dictionary), to retain its structure. For each partial query we use the COLLECT to aggregate these maps into a list, which also reduces our row count (cardinality) to one (1) for the following MATCH. Combining the lists is a simple list concatenation with the “+” operator.

Once we have the complete list, we use UNWIND to transform it back into rows of maps. After this, we use the WITH clause to deconstruct the maps into columns again and perform operations like sorting, pagination, filtering or any other aggregation or operation.

Here is the rewrite of the UNION query:

match (n:Person)-[r:WORKS_AT]->(d:Dept)-[:LEAD_BY]->(m:Person) 
where m.name = "Scott"

with collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as rows

match (n:Person)-[r:WORKS_AT]->(d:Dept)
where d.name = "Operations"

with rows + collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as allRows
UNWIND allRows as row
with row.name as name, row.hoursPerWeek as hoursPerWeek
return name, hoursPerWeek

The first MATCH:
match (n:Person)-[r:WORKS_AT]->(d:Dept)-[:LEAD_BY]->(m:Person) 
where m.name = "Scott"
with collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as rows

This query gets the Person’s data who works at department led by “Scott,” creates a map from the name and hoursPerWeek values and then collects them into a list called “rows.”

Then the second MATCH collects the information of the Person who works at “Operations” department and combines the list “rows” with the new list and adds them in “allRows.”

match (n:Person)-[r:WORKS_AT]->(d:Dept)
where d.name = "Operations"
with rows + collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as allRows

Now we UNWIND that large list as individual “row”s and return the data. (If we only want to return the data, we can combine the last WITH and RETURN into one.)

UNWIND allRows as row
with row.name as name, row.hoursPerWeek as hoursPerWeek
return name, hoursPerWeek

After the UNWIND and WITH clause we can also add all the UNION processing operations like sorting or any other aggregate functions.

So if we want to sort the data and aggregate the data for the rewritten query we can rewrite the same query but adding ORDER BY and an aggregation clause after the UNWIND and WITH as shown:

match (n:Person)-[r:WORKS_AT]->(d:Dept)-[:LEAD_BY]->(m:Person) 
where m.name = "Scott"
with collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as rows
match (n:Person)-[r:WORKS_AT]->(d:Dept)
where d.name = "Operations"
with rows + collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as allRows
UNWIND allRows as row
with row.name as name, row.hoursPerWeek as hoursPerWeek
return name, sum(hoursPerWeek) 
order by name

A short variant of this query would look like this:

match (n:Person {name:"Scott"})-[r:WORKS_AT]->(d:Dept)-[:LEAD_BY]->(m:Person) 
with collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as rows
match (n:Person)-[r:WORKS_AT]->(d:Dept {name:"Operations"})
WITH rows + collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as allRows
UNWIND allRows as row
RETURN row.name as name, sum(row.hoursPerWeek) as hoursPerWeek
order by name




Want to learn more about graph databases and Neo4j? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your mission-critical application today.

Download My Free Copy

 

Keywords:  


About the Author

Rohan Kharwar & Michael Hunger, Neo4j Team

Rohan Kharwar & Michael Hunger Image

Rohan Kharwar is a Field Engineering Consultant at Neo Technology. Rohan brings 15 years of customer technical engagement experience to the Field team with significant emphasis on database and data warehouse development, design and implementation.

Michael Hunger has been passionate about software development for a very long time. For the last few years he has been working with Neo Technology on the open source Neo4j graph database filling many roles. As caretaker of the Neo4j community and ecosystem he especially loves to work with graph-related projects, users and contributors. As a developer Michael enjoys many aspects of programming languages, learning new things every day, participating in exciting and ambitious open source projects and contributing and writing software related books and articles.


10 Comments

Shatadru Saha says:

I believe the last bit of code is slightly wrong. The code under the title “A short variant of this query would look like this:” is as follows:
match (n:Person {name:”Scott”})-[r:WORKS_AT]->(d:Dept)-[:LEAD_BY]->(m:Person)

I believe the code should be as follow:
match (n:Person)-[r:WORKS_AT]->(d:Dept)-[:LEAD_BY]->(m:Person{name:”Scott”})

This is because we want to get the name of people who are lead by “Scott”.

Liliana says:

That’s a very interesting way of approaching this problem. Can you comment on any negative side effects, like is there a potential performance hit, or higher memory consumption or anything like that? Of course readability is a bit better with union, but other than that are there any other things to be careful about when using the collect+ unwind?

Graham says:

Is there any problem if the first collection row has no content? I have simulated this code on 2.0 but seem to get a no results if no results are returned in rows.

I am going back to UNION for now

jarellm says:

@Graham try OPTIONAL MATCH

prasanna says:

Awesome way to handle this scenario. Thank you for the blog post.

Somesh says:

There is a problem here. If the result of the second query doesn’t return any data

match (n:Person)-[r:WORKS_AT]->(d:Dept)
where d.name = “Operations”

with rows + collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as allRows

this part is not executed and as a result allRows contains nothing. Hence you only get the results if the second query matches
We used optional match to solve this

OPTIONAL match (n:Person)-[r:WORKS_AT]->(d:Dept)
where d.name = “Operations”

with rows + collect({name:n.name, hoursPerWeek: r.hoursPerWeek}) as allRows

This ensures the second with statement is executed no matter what.

me says:

Nice, I just applied this method and it works great, thanks for the post!

Arushi Jain says:

If first query returns no result, second query is not executed at all. Is there a way to solve this?

Vishal says:

MATCH (user:User{id:{userId}})-[r:PUBLISHED{type:{type}}]->(content:Content)

WITH COLLECT({user:user, r:r, content:content}) as rows
OPTIONAL MATCH (:User{id:{userId}})(content:Content)

WITH rows + COLLECT({user:user, r:r, content:content}) as allRows
UNWIND allRows as row
WITH row.user as user, row.r as r, row.content as content
RETURN user, r, content
ORDER BY content.publishedOn DESC”;
What is wrong in this query?
thanks in advance

Vishal says:

correction in line number 3
OPTIONAL MATCH (:User{id:{userId}})(content:Content)

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe

Upcoming Event

 

Neo4j GraphTour Register Now

From the CEO

Emil's Blog


Have a Graph Question?

Stackoverflow
Slack
Contact Us

Share your Graph Story?

Email us: content@neotechnology.com


Popular Graph Topics