GraphGists

Cypher vs. SQL

If you have used SQL and want to learn Cypher, this chapter is for you! We won’t dig very deep into either of the languages, but focus on bridging the gap.

Data Model

For our example, we will use data about persons who act in, direct, and produce movies.

Here’s an entity-relationship model for the example:

"Person" -> "Movie" [label="acted in" arrowhead="crow" arrowtail="crow" dir="both"]
"Person" -> "Movie" [label="directed" arrowhead="crow" arrowtail="crow" dir="both"]
"Person" -> "Movie" [label="produced" arrowhead="crow" arrowtail="crow" dir="both"]

We have Person and Movie entities, which are related in three different ways, each of which have many-to-many cardinality.

In a RDBMS we would use tables for the entities as well as for the associative entities (join tables) needed. In this case we decided to go with the following tables: movie, person, acted_in, directed, produced. You’ll find the SQL for this below.

In Neo4j, the basic data units are nodes and relationships. Both can have properties, which correspond to attributes in a RDBMS.

Nodes can be grouped by putting labels on them. In the example, we will use the labels Movie and Person.

When using Neo4j, related entities can be represented directly by using relationships. There’s no need to deal with foreign keys to handle the relationships, the database will take care of such mechanics. Also, the relationships always have full referential integrity. There are no constraints to enable for this, as it’s not optional; it’s really part of the underlying data model. Relationships always have a type, and we will differentiate the different kinds of relationships by using the types ACTED_IN, DIRECTED, PRODUCED.

Sample Data

First off, let’s see how to set up our example data in a RDBMS. We’ll start out creating a few tables and then go on to populate them.

CREATE TABLE movie (
  id INTEGER,
  title VARCHAR(100),
  released INTEGER,
  tagline VARCHAR(100)
);
CREATE TABLE person (
  id INTEGER,
  name VARCHAR(100),
  born INTEGER
);
CREATE TABLE acted_in (
  role varchar(100),
  person_id INTEGER,
  movie_id INTEGER
);
CREATE TABLE directed (
  person_id INTEGER,
  movie_id INTEGER
);
CREATE TABLE produced (
  person_id INTEGER,
  movie_id INTEGER
);

Populating with data:

INSERT INTO movie (id, title, released, tagline)
VALUES (
  (1, 'The Matrix', 1999, 'Welcome to the Real World'),
  (2, 'The Devil''s Advocate', 1997, 'Evil has its winning ways'),
  (3, 'Monster', 2003, 'The first female serial killer of America')
);
INSERT INTO person (id, name, born)
VALUES (
  (1, 'Keanu Reeves', 1964),
  (2, 'Carrie-Anne Moss', 1967),
  (3, 'Laurence Fishburne', 1961),
  (4, 'Hugo Weaving', 1960),
  (5, 'Andy Wachowski', 1967),
  (6, 'Lana Wachowski', 1965),
  (7, 'Joel Silver', 1952),
  (8, 'Charlize Theron', 1975),
  (9, 'Al Pacino', 1940),
  (10, 'Taylor Hackford', 1944)
);
INSERT INTO acted_in (role, person_id, movie_id)
VALUES (
  ('Neo', 1, 1),
  ('Trinity', 2, 1),
  ('Morpheus', 3, 1),
  ('Agent Smith', 4, 1),
  ('Kevin Lomax', 1, 2),
  ('Mary Ann Lomax', 8, 2),
  ('John Milton', 9, 2),
  ('Aileen', 8, 3)
);
INSERT INTO directed (person_id, movie_id)
VALUES (
  (5, 1),
  (6, 1),
  (10, 2)
);
INSERT INTO produced (person_id, movie_id)
VALUES (
  (7, 1),
  (8, 3)
);

Doing this in Neo4j will look quite different. To begin with, we won’t create any schema up front. We’ll come back to schema later, for now it’s enough to know that labels can be used right away without declaring them.

CREATE in Neo4j

In the CREATE statements below, we tell Neo4j what data we want to have in the graph. Simply put, the parentheses denote nodes, while the arrows (-→, or in our case with a relationship type included -[:DIRECTED]→) denote relationships. For the nodes we set identifiers like TheMatrix so we can easily refer to them later on in the statement. Note that the identifiers are scoped to the statement, and not visible to other Cypher statements. We could use identifiers for the relationships as well, but there’s no need for that in this case.

CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})
CREATE (Keanu:Person    {name:'Keanu Reeves',       born:1964})
CREATE (Carrie:Person   {name:'Carrie-Anne Moss',   born:1967})
CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})
CREATE (Hugo:Person     {name:'Hugo Weaving',       born:1960})
CREATE (AndyW:Person    {name:'Andy Wachowski',     born:1967})
CREATE (LanaW:Person    {name:'Lana Wachowski',     born:1965})
CREATE (JoelS:Person    {name:'Joel Silver',        born:1952})
CREATE
  (Keanu)    -[:ACTED_IN {roles:['Neo']}]->         (TheMatrix),
  (Carrie)   -[:ACTED_IN {roles:['Trinity']}]->     (TheMatrix),
  (Laurence) -[:ACTED_IN {roles:['Morpheus']}]->    (TheMatrix),
  (Hugo)     -[:ACTED_IN {roles:['Agent Smith']}]-> (TheMatrix),
  (AndyW)    -[:DIRECTED]->    (TheMatrix),
  (LanaW)    -[:DIRECTED]->    (TheMatrix),
  (JoelS)    -[:PRODUCED]->    (TheMatrix)
CREATE (TheDevilsAdvocate:Movie {title:"The Devil's Advocate", released:1997, tagline: 'Evil has its winning ways'})
CREATE (Monster:Movie {title: 'Monster', released: 2003, tagline: 'The first female serial killer of America'})
CREATE (Charlize:Person {name:'Charlize Theron', born:1975})
CREATE (Al:Person       {name:'Al Pacino',       born:1940})
CREATE (Taylor:Person   {name:'Taylor Hackford', born:1944})
CREATE
  (Keanu)    -[:ACTED_IN {roles:['Kevin Lomax']}]->    (TheDevilsAdvocate),
  (Charlize) -[:ACTED_IN {roles:['Mary Ann Lomax']}]-> (TheDevilsAdvocate),
  (Al)       -[:ACTED_IN {roles:['John Milton']}]->    (TheDevilsAdvocate),
  (Taylor)   -[:DIRECTED]->                            (TheDevilsAdvocate),
  (Charlize) -[:ACTED_IN {roles:['Aileen']}]->         (Monster),
  (Charlize) -[:PRODUCED {roles:['Aileen']}]->         (Monster)

Simple read of data

Let’s find all entries in the movie table and output their title attribute in our RDBMS:

SELECT movie.title
FROM movie;

Using Neo4j, find all nodes labeled Movie and output their title property:

MATCH (movie:Movie)
RETURN movie.title;

MATCH tells Neo4j to match a pattern in the graph. In this case the pattern is very simple: any node with a Movie label on it. We bind the result of the pattern matching to the identifier movie, for use in the RETURN clause. And as you can see, the RETURN keyword of Cypher is similar to SELECT in SQL.

Now let’s get movies released after 1998.

SELECT movie.title
FROM movie
WHERE movie.released > 1998;

In this case the addition actually looks identical in Cypher.

MATCH (movie:Movie)
WHERE movie.released > 1998
RETURN movie.title;

Note however that the semantics of WHERE in Cypher is somewhat different, see [query-where] for more information.

Join

Let’s list all persons and the movies they acted in.

SELECT person.name, movie.title
FROM person
  JOIN acted_in AS acted_in ON acted_in.person_id = person.id
  JOIN movie ON acted_in.movie_id = movie.id;

The same using Cypher:

Here we match a Person and a Movie node, in case they are connected with an ACTED_IN relationship.

MATCH (person:Person)-[:ACTED_IN]->(movie:Movie)
RETURN person.name, movie.title;

Co-Actors of Keanu Reeves

To make things slightly more complex, let’s search for the co-actors of Keanu Reeves. In SQL we use a self join on the person table and join on the acted_in table once for Keanu, and once for the co-actors.

SELECT DISTINCT co_actor.name
FROM person AS keanu
  JOIN acted_in AS acted_in1 ON acted_in1.person_id = keanu.id
  JOIN acted_in AS acted_in2 ON acted_in2.movie_id = acted_in1.movie_id
  JOIN person AS co_actor
    ON acted_in2.person_id = co_actor.id AND co_actor.id <> keanu.id
WHERE keanu.name = 'Keanu Reeves';

In Cypher, we use a pattern with two paths that target the same Movie node.

MATCH (keanu:Person)-[:ACTED_IN]->(movie:Movie),
      (coActor:Person)-[:ACTED_IN]->(movie)
WHERE keanu.name = 'Keanu Reeves'
RETURN DISTINCT coActor.name;

You may have noticed that we used the co_actor.id <> keanu.id predicate in SQL only. This is because Neo4j will only match on the ACTED_IN relationship once in the same pattern. If this is not what we want, we can split the pattern up by using two MATCH clauses like this:

MATCH (keanu:Person)-[:ACTED_IN]->(movie:Movie)
MATCH (coActor:Person)-[:ACTED_IN]->(movie)
WHERE keanu.name = 'Keanu Reeves'
RETURN DISTINCT coActor.name;

This time Keanu Reeves is included in the result as well

Who are the Actor/Producers?

Next, let’s find out who has both acted in and produced movies.

SELECT person.name
FROM person
WHERE person.id IN (SELECT person_id FROM acted_in)
  AND person.id IN (SELECT person_id FROM produced)

In Cypher, we use patterns as predicates in this case. That is, we require the relationships to exist, but don’t care about the connected nodes; thus the empty parentheses.

MATCH (person:Person)
WHERE (person)-[:ACTED_IN]->() AND (person)-[:PRODUCED]->()
RETURN person.name

Aggregation

Now let’s find out a bit about the directors in movies that Keanu Reeves acted in. We want to know how many of those movies each of them directed.

SELECT director.name, count(*)
FROM person keanu
  JOIN acted_in ON keanu.id = acted_in.person_id
  JOIN directed ON acted_in.movie_id = directed.movie_id
  JOIN person AS director ON directed.person_id = director.id
WHERE keanu.name = 'Keanu Reeves'
GROUP BY director.name
ORDER BY count(*) DESC

Here’s how we’ll do the same in Cypher:

MATCH (keanu:Person {name: 'Keanu Reeves'})-[:ACTED_IN]->(movie:Movie),
     (director:Person)-[:DIRECTED]->(movie)
RETURN director.name, count(*)
ORDER BY count(*) DESC

As you can see there is no GROUP BY in the Cypher equivalent. Instead, Neo4j will automatically figure out the grouping key.