Comparing Cypher with SQL

Introduction

This guide shows anyone familiar with SQL how to understand Cypher® and write the equivalent Cypher queries. The Northwind database, used for a long time to demonstrate relational database concepts, helps to illustrate how to translate SQL queries into Cypher, showcasing the similarities and differences between the two languages.

Before diving into the details, it is essential to understand that while both SQL and Cypher are declarative query languages, Cypher is specifically designed for querying graphs. Like in SQL, important concepts of Cypher include clauses, keywords, expressions, operators, and functions. Unlike SQL, which deals with relational databases, Cypher is centered around expressing graph patterns. Graph pattern matching is the core technique in Cypher, enabling creation, navigation, description, and data extraction from a graph by applying declarative patterns.

Cypher allows you to chain multiple query parts together, enabling the composition of complex queries. You can mix read and write operations within a single statement. The read operation can provide inputs for the subsequent write operation or, following after the write operation, return relevant subsets to the caller.

You can use Cypher to update the graph structure and data and even to ingest large amounts of CSV data.

With user-defined procedures and functions, you can extend the language with functionality that you need but is not yet available.

For more information about the Cypher query language, refer to the Neo4j Cypher Manual and the Cypher Cheat Sheet.

Northwind example model

The Northwind database represents the data storage of a retail application. It contains customers, products, orders, employees, shippers, and categories as well as their interactions.

Relational databases store data in tables with a fixed structure (schema), each column having a name, type, length, constraints, etc. References between tables are represented by repeating the primary key of one table as a column in another as a foreign key. For many-to-many references, JOIN-tables (or link-tables) are needed as an artificial construct between the connected tables.

Graph databases store data in a graph with nodes and relationships between them (refer to the property graph model for more information).

For a sensible relational model, the transformation into a graph is not hard, as rows of entity tables are converted into nodes and foreign-key relationships and JOIN-tables into relationships. Nodes and relationships can hold arbitrary properties (key-value pairs) without the need for a fixed schema.

Refer to the relational and graph models below when considering the data structures in the following queries.

Northwind diagram
Figure 1. Relational model
northwind graph workspace arr
Figure 2. Graph model

It is recommended to familiarize yourself with the guides on data modeling and data importing into Neo4j.

Explore Cypher queries: from basic to advanced

This guide intends to introduce Cypher queries by comparing them with the equivalent SQL statements.

You can run all the Cypher queries below using an AuraDB Free instance and the Neo4j Workspace guide Query fundamentals.

  1. Load the Northwind dataset by following instructions in the guide.

    northwind dataset load
    Figure 3. Guide Query fundamentals in Neo4j Workspace
  2. After importing the data, run the command call db.schema visualization in Query to check the imported data vs the model. Your output should be the following:

    northwind schema workspace
    Figure 4. Model of the Nortwind dataset in Workspace Query

Graph data modeling is a flexible task. You can create your own model of the Northwind dataset. Use these .zip files and follow instructions in the tutorial on how to import data from a relational database into a graph database.

The Cypher queries below end with the semicolon ;, which is a requirement in Cypher Shell — the command-line tool that comes with the Neo4j distribution. In Neo4j Browser or Query in Workspace, you don’t have to end Cypher queries with semicolon.

Find all products

Select and return records

To select and return records in SQL, select everything from the products table.

SELECT p.*
FROM products as p;

Similarly in Cypher, you MATCH a simple pattern: all nodes with the label :Product and RETURN them.

MATCH (p:Product)
RETURN p;

Field access, ordering, and paging

It is more efficient to return only a subset of attributes, like ProductName and UnitPrice. You can also order by price and only return the 10 most expensive items.

SELECT p.ProductName, p.UnitPrice
FROM products as p
ORDER BY p.UnitPrice DESC
LIMIT 10;

Since the statements are similar except for the pattern matching part, you can copy and paste the changes from SQL to Cypher. But remember that labels, relationship types, and property names are case sensitive in Neo4j. For more details on naming rules, see the Cypher Manual → Naming rules and recommendations.

MATCH (p:Product)
RETURN p.productName, p.unitPrice
ORDER BY p.unitPrice DESC
LIMIT 10;

Find a single product by name

Filter by equality

If you only want to look at a single product, for example, Chocolade, you can filter data in SQL using the WHERE clause.

SELECT p.ProductName, p.UnitPrice
FROM products AS p
WHERE p.ProductName = 'Chocolade';

In Cypher, the WHERE clause belongs to the MATCH statement.

MATCH (p:Product)
WHERE p.productName = 'Chocolade'
RETURN p.productName, p.unitPrice;

There is a shortcut in Cypher if you MATCH for a labeled node with a certain property attribute.

MATCH (p:Product {productName:'Chocolade'})
RETURN p.productName, p.unitPrice;

Indexing

If you want to match quickly by a specific node label and attribute combination, it makes sense to create an index for that pair, if you haven’t already done that during the import. For more details on how to create indexes, visit the Cypher manual → Indexes for search performance. Indexes in Cypher are only used for finding the starting points of a query, all subsequent pattern matching is done through the graph structure. Cypher supports range, text, point, lookup, full-text, and vector indexes.

CREATE INDEX Product_productName IF NOT EXISTS FOR (p:Product) ON p.productName;
CREATE INDEX Product_unitPrice IF NOT EXISTS FOR (p:Product) ON p.unitPrice;

Filter products

Filter by list/range

You can filter by multiple values in SQL.

SELECT p.ProductName, p.UnitPrice
FROM products as p
WHERE p.ProductName IN ('Chocolade','Chai');

This can also be achieved in Cypher, which has full collection support, including not only the IN operator but also collection functions, predicates, and transformations.

MATCH (p:Product)
WHERE p.productName IN ['Chocolade','Chai']
RETURN p.productName, p.unitPrice;

Filter by multiple numeric and textual predicates

Filtering can go further. For example, the below query tries to find expensive products starting with "C".

SELECT p.ProductName, p.UnitPrice
FROM products AS p
WHERE p.ProductName LIKE 'C%' AND p.UnitPrice > 100;

In Cypher, the LIKE operator is replaced by the STARTS WITH, CONTAINS, and ENDS WITH operators (all three of which are index-supported):

MATCH (p:Product)
WHERE p.productName STARTS WITH 'C' AND p.unitPrice > 100
RETURN p.productName, p.unitPrice;

You can also use a regular expression, like p.productName =~ '^C.*'.

Joining products with customers

Join records, distinct results

In SQL, if you want to see who bought Chocolade, you can join the four tables together. Refer to the model (ER-diagram) to recall what the model looks like.

SELECT DISTINCT c.CompanyName
FROM customers AS c
JOIN orders AS o ON (c.CustomerID = o.CustomerID)
JOIN order_details AS od ON (o.OrderID = od.OrderID)
JOIN products AS p ON (od.ProductID = p.ProductID)
WHERE p.ProductName = 'Chocolade';

Using Cypher, the graph model is much simpler, as there is no need to JOIN tables. Expressing connections as graph patterns is easier to read too.

MATCH (p:Product {productName:'Chocolade'})<-[:ORDERS]-(:Order)<-[:PURCHASED]-(c:Customer)
RETURN DISTINCT c.companyName;

New customers without existing orders

OUTER JOINS, aggregation

In SQL, if you rephrase the question as "What have I bought and paid in total?", the JOIN stays the same; only the filter expression changes. However, the situation is different if you have customers without any orders and still want to return them. In that case, you have to use OUTER JOINS to make sure that results are returned even if there are no matching rows in other tables.

SELECT p.ProductName, sum(od.UnitPrice * od.Quantity) AS Volume
FROM customers AS c
LEFT OUTER JOIN orders AS o ON (c.CustomerID = o.CustomerID)
LEFT OUTER JOIN order_details AS od ON (o.OrderID = od.OrderID)
LEFT OUTER JOIN products AS p ON (od.ProductID = p.ProductID)
WHERE c.CompanyName = 'Drachenblut Delikatessen'
GROUP BY p.ProductName
ORDER BY Volume DESC;

In your Cypher query, the MATCH between customer and order becomes an OPTIONAL MATCH, which is the equivalent of an OUTER JOIN. Non existing nodes and relationships will then have a null value, which will result in attributes being null and not being aggregated by sum.

MATCH (c:Customer {companyName:'Drachenblut Delikatessen'})
OPTIONAL MATCH (p:Product)<-[o:ORDERS]-(:Order)<-[:PURCHASED]-(c)
RETURN p.productName, toInteger(sum(o.unitPrice * o.quantity)) AS volume
ORDER BY volume DESC;

Top-selling employees

Aggregation, grouping

The previous example mentioned aggregation. By summing up product prices and ordered quantities, an aggregated view per product for the customer was provided.

You can use aggregation functions like sum, count, avg, max both in SQL and Cypher. In SQL, aggregation is explicit so you have to provide all grouping keys again in the GROUP BY clause. If you want to see the top-selling employees, run the following query:

SELECT e.EmployeeID, e.FirstName, e.LastName, COUNT(*) AS Count
FROM Employee AS e
JOIN Orders AS o ON (o.EmployeeID = e.EmployeeID)
GROUP BY e.EmployeeID, e.FirstName, e.LastName
ORDER BY Count DESC
LIMIT 10;

In Cypher grouping for aggregation is implicit. As soon as you use the first aggregation function, all non-aggregated columns automatically become grouping keys. Also, additional aggregation functions like collect, percentileCont, stdDev are available.

MATCH (:Order)<-[:SOLD]-(e:Employee)
WITH e, count(*) as cnt
ORDER BY cnt DESC LIMIT 10
RETURN e.employeeID, e.firstName, e.lastName, cnt

Employee territories

Collecting master-detail queries

In SQL, one of the challenging scenarios is dealing with master-detail information. You have one main entity (master, head, parent) and many dependent ones (detail, position, child). Usually you either query it by joining both and returning the master data multiple times (once for each detail) or by only fetching the primary key of the master and then pulling all detail rows via that foreign key.

For instance, if you look at the employees per territory, then the territory information is returned for each employee.

SELECT e.LastName, et.Description
FROM Employee AS e
JOIN EmployeeTerritory AS et ON (et.EmployeeID = e.EmployeeID)
JOIN Territory AS t ON (et.TerritoryID = t.TerritoryID);

In Cypher, you can either return the structure like in SQL or use the collect() aggregation function, which aggregates values into a collection (list, array). This way, only one row per parent, containing an inlined collection of child values, is returned. This also works for nested values.

MATCH (t:Territory)<-[:IN_TERRITORY]-(e:Employee)
RETURN t.territoryDescription, collect(e.lastName);

Product categories

Hierarchies and trees, variable length joins

If you have to express category-, territory- or organizational hierarchies in SQL, it is usually modeled with a self-join via a foreign key from child to parent. Adding data is not problematic, and neither are single-level queries. As soon as you get into multi-level queries, the number of joins drastically increases, especially if your level depth is not fixed.

Taking the example of the product categories, you have to decide upfront up to how many levels of categories you want to query. Only three potential levels are shown here (which means 1+2+3 = 6 self-joins of the ProductCategory table).

SELECT p.ProductName
FROM Product AS p
JOIN ProductCategory pc ON (p.CategoryID = pc.CategoryID AND pc.CategoryName = "Dairy Products")

JOIN ProductCategory pc1 ON (p.CategoryID = pc1.CategoryID
JOIN ProductCategory pc2 ON (pc2.ParentID = pc2.CategoryID AND pc2.CategoryName = "Dairy Products")

JOIN ProductCategory pc3 ON (p.CategoryID = pc3.CategoryID
JOIN ProductCategory pc4 ON (pc3.ParentID = pc4.CategoryID)
JOIN ProductCategory pc5 ON (pc4.ParentID = pc5.CategoryID AND pc5.CategoryName = "Dairy Products")
;

Cypher is able to express hierarchies of any depth using only the appropriate relationships. Variable levels are represented by variable length paths, which are denoted by a star * after the relationship type and optional limits (min..max).

MATCH (p:Product)-[:PART_OF]->(l:Category)-[:PARENT*0..]-(:Category {name:'Dairy Products'})
RETURN p.name;