Tutorial: Create a graph data model
This tutorial is designed to help you understand how to model your data based on what you intend to use it for. You will use the Movies example dataset as the main resource.
For an interactive course on the fundamentals of data modeling, see GraphAcademy. |
Define the domain
In this tutorial, you will use the Movies example dataset, the domain includes movies, people who acted or directed movies, and users who rated movies. It is in the connections (relationships) between these entities that you find insights about your domain.
Define the use case
With the domain defined, you need to identify your application use cases. In other words, what questions are you trying to answer?
You can make a list of questions to help you identify the application use cases. The questions will help you define what you need from the application, and what data must be included in the graph.
For this tutorial, your application should be able to answer these questions:
-
Which people acted in a movie?
-
Which person directed a movie?
-
Which movies did a person act in?
-
How many users rated a movie?
-
Who was the youngest person to act in a movie?
-
Which role did a person play in a movie?
-
Which is the highest rated movie in a particular year according to imDB?
-
Which drama movies did an actor act in?
-
Which users gave a movie a rating of 5?
Define the purpose
When designing a graph data model for an application, you may need both a data model and an instance model.
Data model
The data model describes the nodes and relationships in the domain and includes labels, types, and properties. It doesn’t contain any data but shows what information might be needed to answer the use cases.
At this stage, you can opt to use no-code tools to visualize your plan. With Arrows.app, for example, you can draft a data model that includes node labels, relationship types, and properties:
With your example domain and initial questions in mind, you can list the information you need to have in your first model:
-
Differentiation between a person who acted in a movie, who directed a movie, and who rated a movie.
-
What ratings were given, how many there are, and when they were submitted.
-
Which role an actor played in a movie and what their age is.
-
The genres of the movies.
-
Etc.
Note that in the model, labels, relationship types, and property keys follow a certain syntax. In Cypher®, these are called identifiers and they are case-sensitive, as are string values.
See the Cypher style guide for more information. While not mandatory, it is recommended that:
-
Labels are capitalized and should be CamelCase (e.g.,
Person
,Movie
,ImdbUser
). -
Relationship types are written with all capital letters and an underscore character as a separator (e.g.,
DIRECTED
,ACTED_IN
). -
Property keys for nodes or relationships are not capitalized and can be camelCase (e.g.
name
,userID
).
At this stage of creating your initial model, focus on the high-level design of your model, i.e. how your entities connect. The Define entities step describes a more detailed view on how to allocate certain information in a graph (e.g., as a node, relationship, property, etc). |
Instance model
An instance model is a representation of the data that is stored and processed in the actual model. You can use an instance model to test against your use cases.
To create an instance model, you need to have some sample data and load it to a deployment of your choice. The current example is a small but representative dataset:
CREATE (Apollo13:Movie {title: 'Apollo 13', tmdbID: 568, released: '1995-06-30', imdbRating: 7.6, genres: ['Drama', 'Adventure', 'IMAX']})
CREATE (TomH:Person {name: 'Tom Hanks', tmdbID: 31, born: '1956-07-09'})
CREATE (MegR:Person {name: 'Meg Ryan', tmdbID: 5344, born: '1961-11-19'})
CREATE (DannyD:Person {name: 'Danny DeVito', tmdbID: 518, born: '1944-11-17'})
CREATE (JackN:Person {name: 'Jack Nicholson', tmdbID: 514, born: '1937-04-22'})
CREATE (SleeplessInSeattle:Movie {title: 'Sleepless in Seattle', tmdbID: 858, released: '1993-06-25', imdbRating: 6.8, genres: ['Comedy', 'Drama', 'Romance']})
CREATE (Hoffa:Movie {title: 'Hoffa', tmdbID: 10410, released: '1992-12-25', imdbRating: 6.6, genres: ['Crime', 'Drama']})
The data used here can be found in the Movies example dataset, also available in Browser and Aura guides. However, in order to practice data modeling, it is recommended that you add the data manually using Cypher. |
Define entities
An instance model helps you preview how the data will be stored as nodes, relationships, and properties. The next step is to refine your model with more details.
Labels
The dominant nouns in your application use case are represented as nodes in your model and can be used as node labels. For example:
-
Which person acted in a movie?
-
How many users rated a movie?
The nodes in your initial model are thus Person, Movie, and User. Note that creating a model is an iterative process and, after refactoring, your model may look different.
Read more about labels in Graph database concepts. |
Node properties
You can use node properties to:
MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]-(m:Movie)
RETURN m
MATCH (p:Person)-[:ACTED_IN]-(m:Movie {title: 'Apollo 13'})-[:RATED]-(u:User)
RETURN p,u
MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]-(m:Movie)
RETURN m.title, m.released
With these properties, it is easier to visualize what you need from the graph to answer the use case questions. For example:
Use case | Steps required | Query example |
---|---|---|
Which people acted in a movie? |
|
|
Which person directed a movie? |
|
|
Which movies did a person act in? |
|
|
Who was the youngest person to act in a movie? |
|
|
What is the highest rated movie in a particular year according to imDB? |
|
|
Unique identifiers
In Cypher, it is possible to create two different nodes with the exact same data. However, from a data management and model perspective, different nodes should contain different data. You can use unique identifiers to make sure that every node is a separate and distinguished entity.
In the initial instance model, these are the properties set for the Movies
nodes:
-
Movie.title
(string) -
Movie.tmdbID
(integer) -
Movie.released
(date) -
Movie.imdbRating
(decimal between 0-10) -
Movie.genres
(list of strings)
And for the Person
nodes:
-
Person.name
(string) -
Person.tmdbID
(integer) -
Person.born
(date)
The Movie
node property tmdbID
is a good example of a unique identifier, as there might be different movies with the same title in the database, but the property will be different and thus function as a unique identifier.
It is strongly suggested that you enforce unique identifiers by using uniqueness constrains. Read more about this topic in Cypher → Create property uniqueness constraints. |
Relationships
Relationships are connections between nodes, and these connections are the verbs in your use cases:
-
Which person acted in a movie?
-
Which person directed a movie?
At a glance, connections seem straightforward, but their micro- and macro-design are arguably the most critical factors in graph performance. To get started, thinking of relationships from the perspective that “connections are verbs” works well, but there are other important considerations that you will learn as you advance with your model.
Naming
It is important to choose good names (types) for the relationships in the graph and be as specific as possible in order to allow Neo4j to traverse only relevant connections.
For example, instead of connecting two nodes with a generic relationship type (e.g. CONNECTED_TO
), prefer to be more specific and intuitive about the way those entities connect.
For this tutorial sample, you could define relationships as:
-
ACTED_IN
-
DIRECTED
With these options, you can already plan the direction of the relationships.
Relationship direction
All relationships must have a direction. When created, relationships need to specify their direction explicity or be inferred by the left-to-right order of the pattern.
In the example use cases, the ACTED_IN
relationship must be created to go from a Person
node to a Movie
node:
To add all ACTED_IN
and DIRECTED
relationships, you can use this statement:
MERGE (TomH)-[:ACTED_IN]->(Apollo13)
MERGE (TomH)-[:ACTED_IN]->(SleeplessInSeattle)
MERGE (MegR)-[:ACTED_IN]->(SleeplessInSeattle)
MERGE (DannyD)-[:ACTED_IN]->(Hoffa)
MERGE (DannyD)-[:DIRECTED]->(Hoffa)
MERGE (JackN)-[:ACTED_IN]->(Hoffa)
And your graph should now look like this:
You can always use the query |
Relationship properties
Properties for a relationship are used to enrich how two nodes are related. When you need to know how two nodes are related and not just that they are related, you can use relationship properties to further define the relationship.
The example question "Which role did a person play in a movie?" can be asked with the help of the property roles
in the ACTED_IN
relationship:
Note that the information about roles needs to be added to the graph before being retrieved. You can use this Cypher statement for that:
MERGE (TomH)-[:ACTED_IN {roles:'Jim Lovell'}]->(Apollo13)
MERGE (TomH)-[:ACTED_IN {roles:'Sam Baldwin'}]->(SleeplessInSeattle)
MERGE (MegR)-[:ACTED_IN {roles:'Annie Reed'}]->(SleeplessInSeattle)
MERGE (DannyD)-[:ACTED_IN {roles:'Robert "Bobby" Ciaro'}]->(Hoffa)
MERGE (JackN)-[:ACTED_IN {roles:'Hoffa'}]->(Hoffa)
Then, in order to find which role Tom Hanks played in Apollo 13, you use the following statement:
MATCH (p:Person {name:'Tom Hanks'})-[r:ACTED_IN]->(m:Movie {title:'Apollo 13'})
RETURN r.roles
With the addition of the new relationship property, your graph should now look like this:
Add more data
Now that you have created the first connections between the nodes, it’s time to add more information to the graph. This way, you can answer more questions, such as:
-
How many users rated a movie?
-
Which users gave a movie a rating of 5?
To answer these questions, you need information about users and their ratings in your graph, which means a change in your data model.
Note that, with the addition of new data such as the property roles
in the ACTED_IN
relationship, your initial data model has already been updated along the way:
You can start by adding the users to your graph:
MERGE (Sandy:User {name: 'Sandy Jones', userID: 1})
MERGE (Clinton:User {name: 'Clinton Spencer, userID: 2'})
While it is possible to add user information as a |
Then, connect the User
nodes to the Movie
nodes through a RATED
relationship which contains the rating
property:
MERGE (Sandy)-[:RATED {rating:5}]->(Apollo13)
MERGE (Sandy)-[:RATED {rating:4}]->(SleeplessInSeattle)
MERGE (Clinton)-[:RATED {rating:3}]->(Apollo13)
MERGE (Clinton)-[:RATED {rating:3}]->(SleeplessInSeattle)
MERGE (Clinton)-[:RATED {rating:3}]->(Hoffa)
Your graph should now look like this:
Test the model
After populating the graph to implement the data model with a small set of test data, you should now test it to ensure that it satisfies every use case.
For example, if you want to test the use case "Which people acted in a movie?", you can run the following query:
MATCH (p:Person)-[:ACTED_IN]-(m:Movie)
WHERE m.title = 'Sleepless in Seattle'
RETURN p.name
This is just a simple example of testing. As you go through the use cases, you may think of more data to be added to the graph in order to complete the testing.
Additionally, make sure that the Cypher statements used to test the use cases are correct. A query written incorrectly could lead to the assumption that the data model has failed.
For example, using an incorrect node label in a test may lead you to believe that the data doesn’t exist in the graph.
At this point, you can also start considering the scalability of your graph and how performant it would be if you write the same queries in a graph with millions of nodes and relationships.
Refactoring
The next step, refactoring, is about making adjustments after you are finished testing your graph. Refer to Tutorial: Refactoring for instructions.