2.1. The Neo4j Graph Database
A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way.
For terminology around graph databases, see Terminology.
Here’s an example graph which we will approach step by step in the following sections:
A graph records data in nodes and relationships. Both can have properties. This is sometimes referred to as the Property Graph Model.
The fundamental units that form a graph are nodes and relationships. In Neo4j, both nodes and relationships can contain properties.
Nodes are often used to represent entities, but depending on the domain relationships may be used for that purpose as well.
Apart from properties and relationships, nodes can also be labeled with zero or more labels.
The simplest possible graph is a single Node.
A Node can have zero or more named values referred to as properties.
Let’s start out with one node that has a single property named
The next step is to have multiple nodes. Let’s add two more nodes and one more property on the node in the previous example:
Relationships organize the nodes by connecting them. A relationship connects two nodes — a start node and an end node. Just like nodes, relationships can have properties.
Relationships between nodes are a key part of a graph database. They allow for finding related data. Just like nodes, relationships can have properties.
A relationship connects two nodes, and is guaranteed to have valid start and end nodes.
Relationships organize nodes into arbitrary structures, allowing a graph to resemble a list, a tree, a map, or a compound entity — any of which can be combined into yet more complex, richly inter-connected structures.
Our example graph will make a lot more sense once we add relationships to it:
Our example uses
DIRECTED as relationship types.
roles property on the
ACTED_IN relationship has an array value with a single item in it.
Below is an
ACTED_IN relationship, with the
Tom Hanks node as start node and
Forrest Gump as end node.
You could also say that the
Tom Hanks node has an outgoing relationship, while the
Forrest Gump node has an incoming relationship.
|Relationships are equally well traversed in either direction.
This means that there is no need to add duplicate relationships in the opposite direction (with regard to traversal or performance).
While relationships always have a direction, you can ignore the direction where it is not useful in your application.
Note that a node can have relationships to itself as well:
The example above would mean that
To further enhance graph traversal all relationships have a relationship type.
Let’s have a look at what can be found by simply following the relationships of a node in our example graph:
Using relationship direction and type
|What we want to know||Start from||Relationship type||Direction|
get actors in movie
get movies with actor
get directors of movie
get movies directed by
Both nodes and relationships can have properties.
Properties are named values where the name is a string. The supported property values are:
- Numeric values,
- String values,
- Boolean values,
- Collections of any other type of value.
For further details on supported property values, see Section 35.3, “Property values”.
Labels assign roles or types to nodes.
A label is a named graph construct that is used to group nodes into sets; all nodes labeled with the same label belongs to the same set. Many database queries can work with these sets instead of the whole graph, making queries easier to write and more efficient to execute. A node may be labeled with any number of labels, including none, making labels an optional addition to the graph.
Labels are used when defining constraints and adding indexes for properties (see the section called “Schema”).
An example would be a label named
User that you label all your nodes representing users with.
With that in place, you can ask Neo4j to perform operations only on your user nodes, such as finding all users with a given name.
However, you can use labels for much more.
For instance, since labels can be added and removed during runtime, they can be used to mark temporary states for your nodes.
You might create an
Offline label for phones that are offline, a
Happy label for happy pets, and so on.
In our example, we’ll add
Movie labels to our graph:
A node can have multiple labels, let’s add an
Actor label to the
Tom Hanks node.
Any non-empty Unicode string can be used as a label name.
In Cypher, you may need to use the backtick (
`) syntax to avoid clashes with Cypher identifier rules or to allow non-alphanumeric characters in a label.
By convention, labels are written with CamelCase notation, with the first letter in upper case.
Labels have an id space of an int, meaning the maximum number of labels the database can contain is roughly 2 billion.
A traversal navigates through a graph to find paths.
A traversal is how you query a graph, navigating from starting nodes to related nodes, finding answers to questions like “what music do my friends like that I don’t yet own,” or “if this power supply goes down, what web services are affected?”
Traversing a graph means visiting its nodes, following relationships according to some rules. In most cases only a subgraph is visited, as you already know where in the graph the interesting nodes and relationships are found.
Cypher provides a declarative way to query the graph powered by traversals and other techniques. See Cypher Query Language for more information.
When writing server plugins or using Neo4j embedded, Neo4j provides a callback based traversal API which lets you specify the traversal rules. At a basic level there’s a choice between traversing breadth- or depth-first.
If we want to find out which movies Tom Hanks acted in according to our tiny example database the traversal would start from the
Tom Hanks node, follow any
ACTED_IN relationships connected to the node, and end up with
Forrest Gump as the result (see the dashed lines):
A path is one or more nodes with connecting relationships, typically retrieved as a query or traversal result.
In the previous example, the traversal result could be returned as a path:
The path above has length one.
The shortest possible path has length zero — that is it contains only a single node and no relationships — and can look like this:
This path has length one:
Neo4j is a schema-optional graph database.
You can use Neo4j without any schema. Optionally you can introduce it in order to gain performance or modeling benefits. This allows a way of working where the schema does not get in your way until you are at a stage where you want to reap the benefits of having one.
Schema commands can only be applied on the master machine in a Neo4j cluster (see Chapter 25, High Availability).
If you apply them on a slave you will receive a
Performance is gained by creating indexes, which improve the speed of looking up nodes in the database.
This feature was introduced in Neo4j 2.0, and is not the same as the legacy indexes (see Chapter 37, Legacy Indexing).
Once you’ve specified which properties to index, Neo4j will make sure your indexes are kept up to date as your graph evolves. Any operation that looks up nodes by the newly indexed properties will see a significant performance boost.
Indexes in Neo4j are eventually available. That means that when you first create an index the operation returns immediately. The index is populating in the background and so is not immediately available for querying. When the index has been fully populated it will eventually come online. That means that it is now ready to be used in queries.
If something should go wrong with the index, it can end up in a failed state. When it is failed, it will not be used to speed up queries. To rebuild it, you can drop and recreate the index. Look at logs for clues about the failure.
You can track the status of your index by asking for the index state through the API you are using. Note, however, that this is not yet possible through Cypher.
How to use indexes through the different APIs:
- Cypher: Section 14.1, “Indexes”
- REST API: Section 21.15, “Indexing”
- Listing Indexes via Shell: the section called “Listing Indexes and Constraints”
- Java Core API: Section 35.4, “User database with indexes”
This feature was introduced in Neo4j 2.0.
Neo4j can help you keep your data clean. It does so using constraints, that allow you to specify the rules for what your data should look like. Any changes that break these rules will be denied.
In this version, unique constraints is the only available constraint type.
How to use constraints through the different APIs:
- Cypher: Section 14.2, “Constraints”
- REST API: Section 21.16, “Constraints”
- Listing Constraints via Shell: the section called “Listing Indexes and Constraints”