Chapter 1. Introduction

Table of Contents

This chapter introduces graph database concepts and Neo4j highlights.

1.1. Neo4j highlights

Connected data is all around us. Neo4j supports rapid development of graph powered systems that take advantage of the rich connectedness of data.

A native graph database: Neo4j is built from the ground up to be a graph database. The architecture is designed for optimizing fast management, storage, and traversal of nodes and relationships. In Neo4j, relationships are first class citizens that represent pre-materialized connections between entities. An operation known in the relational database world as a join, whose performance degrades exponentially with the number of relationships, is performed by Neo4j as navigation from one node to another, whose performance is linear.

This different approach to storing and querying connections between entities provides traversal performance of up to 4 million hops per second and core. As most graph searches are local to the larger neighborhood of a node, the total amount of data stored in a database will not affect operations runtime. Dedicated memory management, and highly scalable and memory efficient operations, contribute to the benefits.

Whiteboard friendly: The property graph approach allows consistent use of the same model throughout conception, design, implementation, storage, and visualization of any domain or use case. This allows all business stakeholders to participate throughout the development cycle. With the schema optional model, the domain model can be evolved continuously as requirements change, without penalty of expensive schema changes and migrations.

Cypher, the declarative graph query language, is designed to visually represent graph patterns of nodes and relationships. This highly capable, yet easily readable, query language is centered around the patterns that express concepts or questions from a specific domain. Cypher can also be extended for narrow optimizations for specific use cases.

Supports rapid development: Neo4j supports fast development of graph powered systems. Neo4j’s development stems from the need to run real-time queries on highly related information; something no other database can provide. These unique Neo4j features get you up and running quickly and sustain fast application development for highly scalable applications.

Provides true data safety through ACID transactions: Neo4j uses transactions to guarantee that data is persisted in the case of hardware failure or system crashes.

Designed for business-critical and high-performance operations: Neo4j uses a replicated master-slave cluster setup. It can store hundreds of trillions of entities for the largest datasets imaginable while being sensitive to compact storage. Neo4j can be deployed as a scalable, fault-tolerant cluster of machines. Due to its high scalability, Neo4j clusters require only tens of machines, not hundreds or thousands, saving on cost and operational complexity. Other features for production applications include hot-backups and extensive monitoring.

Neo4j’s application is only limited by your imagination.

1.2. Graph database concepts

This chapter contains an introduction to the graph data model.

1.2.1. The Neo4j graph database

A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way.

For graph database terminology, see Terminology.

Here’s an example graph which we will approach step by step in the following sections:

alt

1.2.1.1. Nodes

A graph records data in nodes and relationships. Both can have properties. This is sometimes referred to as the "Property Graph Model".

The fundamental units that form a graph are nodes and relationships. In Neo4j, both nodes and relationships can contain properties.

Nodes are often used to represent entities, but depending on the domain relationships may be used for that purpose as well.

In addition to having properties and relationships, nodes can also be labeled with one or more labels.

The simplest possible graph is a single Node. A Node can have zero or more named values referred to as properties. Let’s start out with one node that has a single property named title:

alt

The next step is to have multiple nodes. Let’s add two more nodes and one more property on the node in the previous example:

alt

1.2.1.2. Relationships

Relationships organize the nodes by connecting them. A relationship connects two nodes — a start node and an end node. Just like nodes, relationships can have properties.

Relationships between nodes are a key part of a graph database. They allow for finding related data. Just like nodes, relationships can have properties.

A relationship connects two nodes, and is guaranteed to have valid start and end nodes.

Relationships organize nodes into arbitrary structures, allowing a graph to resemble a list, a tree, a map, or a compound entity — any of which can be combined into yet more complex, richly inter-connected structures.

Our example graph will make a lot more sense once we add relationships to it:

alt

Our example uses ACTED_IN and DIRECTED as relationship types. The roles property on the ACTED_IN relationship has an array value with a single item in it.

Below is an ACTED_IN relationship, with the Tom Hanks node as start node and Forrest Gump as end node.

alt

You could also say that the Tom Hanks node has an outgoing relationship, while the Forrest Gump node has an incoming relationship.

This means that there is no need to add duplicate relationships in the opposite direction (with regard to traversal or performance).

While relationships always have a direction, you can ignore the direction where it is not useful in your application.

Note that a node can have relationships to itself as well:

alt

The example above would mean that Tom Hanks KNOWS himself.

To further enhance graph traversal all relationships have a relationship type.

Let’s have a look at what can be found by simply following the relationships of a node in our example graph:

alt
Table 1.1. Using relationship direction and type
What we want to know Start from Relationship type Direction

get actors in movie

movie node

ACTED_IN

incoming

get movies with actor

person node

ACTED_IN

outgoing

get directors of movie

movie node

DIRECTED

incoming

get movies directed by

person node

DIRECTED

outgoing

1.2.1.3. Properties

Both nodes and relationships can have properties.

Properties are named values where the name is a string. The supported property values are:

  • Numeric values
  • String values
  • Boolean values
  • Lists of any other type of value

NULL is not a valid property value. Instead of storing it in the database NULL can be modeled by the absence of a key.

Table 1.2. Property value types
Type Description Value range

boolean

 

true/false

byte

8-bit integer

-128 to 127, inclusive

short

16-bit integer

-32768 to 32767, inclusive

int

32-bit integer

-2147483648 to 2147483647, inclusive

long

64-bit integer

-9223372036854775808 to 9223372036854775807, inclusive

float

32-bit IEEE 754 floating-point number

 

double

64-bit IEEE 754 floating-point number

 

char

16-bit unsigned integers representing Unicode characters

u0000 to uffff (0 to 65535)

String

sequence of Unicode characters

 

For further details on float/double values, see Java Language Specification.

1.2.1.4. Labels

Labels assign roles or types to nodes.

A label is a named graph construct that is used to group nodes into sets; all nodes labeled with the same label belongs to the same set. Many database queries can work with these sets instead of the whole graph, making queries easier to write and more efficient to execute. A node may be labeled with any number of labels, including none, making labels an optional addition to the graph.

Labels are used when defining constraints and adding indexes for properties (see Section 1.2.1.7, “Schema”).

An example would be a label named User that you label all your nodes representing users with. With that in place, you can ask Neo4j to perform operations only on your user nodes, such as finding all users with a given name.

However, you can use labels for much more. For instance, since labels can be added and removed during runtime, they can be used to mark temporary states for your nodes. You might create an Offline label for phones that are offline, a Happy label for happy pets, and so on.

In our example, we’ll add Person and Movie labels to our graph:

alt

A node can have multiple labels, let’s add an Actor label to the Tom Hanks node.

alt
1.2.1.4.1. Label names

Any non-empty Unicode string can be used as a label name. In Cypher, you may need to use the backtick (`) syntax to avoid clashes with Cypher identifier rules or to allow non-alphanumeric characters in a label. By convention, labels are written with CamelCase notation, with the first letter in upper case. For instance, User or CarOwner.

Labels have an id space of an int, meaning the maximum number of labels the database can contain is roughly 2 billion.

1.2.1.5. Traversal

A traversal navigates through a graph to find paths.

A traversal is how you query a graph, navigating from starting nodes to related nodes, finding answers to questions like "what music do my friends like that I don’t yet own," or "if this power supply goes down, what web services are affected?"

Traversing a graph means visiting its nodes, following relationships according to some rules. In most cases only a subgraph is visited, as you already know where in the graph the interesting nodes and relationships are found.

Cypher provides a declarative way to query the graph powered by traversals and other techniques. See Chapter 3, Cypher query language for more information.

If we want to find out which movies Tom Hanks acted in according to our tiny example database, the traversal would start from the Tom Hanks node, follow any ACTED_IN relationships connected to the node, and end up with Forrest Gump as the result (see the dashed lines):

alt

1.2.1.6. Paths

A path is one or more nodes with connecting relationships, typically retrieved as a query or traversal result.

In the previous example, the traversal result could be returned as a path:

alt

The path above has length one.

The shortest possible path has length zero — that is, it contains only a single node and no relationships — and can look like this:

alt

This path has length one:

alt

1.2.1.7. Schema

Neo4j is a schema-optional graph database.

You can use Neo4j without any schema. Optionally, you can introduce it in order to gain performance or modeling benefits. This allows a way of working where the schema does not get in your way until you are at a stage where you want to reap the benefits of having one.

Schema commands can only be applied on the master machine in a Neo4j cluster. If you apply them on a slave you will receive a Neo.ClientError.Transaction.InvalidType error code (see Section A.1, “Neo4j Status Codes”).

1.2.1.7.1. Indexes

Performance is gained by creating indexes, which improve the speed of looking up nodes in the database.

Once you have specified which properties to index, Neo4j will make sure your indexes are kept up to date as your graph evolves. Any operation that looks up nodes by the newly indexed properties will see a significant performance boost.

Indexes in Neo4j are eventually available. That means that when you first create an index the operation returns immediately. The index is populating in the background and so is not immediately available for querying. When the index has been fully populated it will eventually come online. That means that it is now ready to be used in queries.

If something should go wrong with the index, it can end up in a failed state. When it is failed, it will not be used to speed up queries. To rebuild it, you can drop and recreate the index. Look at logs for clues about the failure.

For working with indexes in Cypher, see Section 3.7.1, “Indexes”

1.2.1.7.2. Constraints

Neo4j can help keep your data clean. It does so using constraints. Constraints allow you to specify the rules for what your data should look like. Any changes that break these rules will be denied.

For working with constraints in Cypher, see Section 3.7.2, “Constraints”