Graph Data Modeling Core Principles

About this module

At the end of this module, you will be able to:

  • Describe graph data modeling best practices for modeling:

    • Nodes (entities)

    • Relationships

    • Properties

  • Describe data accessibility

Best practices for modeling nodes

Your model must address:

  • Uniqueness of nodes

  • Complex data

Uniqueness of nodes: before

The Country nodes are considered super nodes (a node with lots of fan-in or fan-out). You need to be careful about using them in your design and developers in particular need to be mindful of queries that might select all paths in or out of the super node.

In models like this one, an address must be uniquely identified because it represents a single point of data in the graph. This is important in fraud detection applications where variations of the same address may be used to make a fraudulent transaction against the application.

In this particular example, it is clear from the wider context that the various 500 Main St nodes are all distinct places. We also see that postal code is not unique and we want to provide uniqueness to it by adding the country.

Uniqueness of nodes: after


Here is a solution where we have ensured that the Address and PostalCode nodes are unique.

We consider it a best practice to always have a property (or set of properties) that uniquely identify a node. Here, we have added a geolocation property to do so. The geolocation property will likely never be used in a read query, but it can be used to differentiate nodes, especially when loading the data.

There is a trade-off, however as adding uniqueness to a node such Address as it is harder to anchor just on the Line1 value, and may be more difficult to modify the graph data later.

Complex data in One Node


You need to strike a balance between number of properties that represent complex data vs. multiple nodes and relationships.

Here we have a property node that contains properties that contain complex data.

Use fanout judiciously for complex data now spread across multiple nodes

  • Reduce property duplication.

  • Reduce gather-and-inspect.

Here we show an extreme implementation of fanout. For modeling complex data, the previous example with all properties in a node and this example where each property is in its own node are usually suboptimal.

In general, you use fanout to do one of two things.

  1. Reduce duplication of properties. Instead of having a repeated property on every node, you can instead have all of those nodes connected to a shared node with that property. This can make data updates massively easier.

  2. Reduce gather-and-inspect behavior during a traversal. In the one node example, if we want to find every address in the city of Axebridge, we would need to check the properties on every Person node, then discard the irrelevant nodes. This is grossly inefficient. In this multiple node case, this is a simple matter of locating the singular Axebridge node, and traversing to every Address node connected to it. This model has no “wasted” hops.

As a result, you generally use fanout for anchors and traversing the graph. You almost never see fanout used for output, unique identifier, or decorator properties, because it makes traversal a few hops longer for no real benefit. This is why this maximum-fanout case is usually undesirable. It is almost never the case that every property is an anchor or used for the traversal!

Best practices for modeling relationships

Your model must address:

  • Using specific relationship types.

  • Reducing symmetric relationships.

  • Using types vs. properties.

Using specific relationship types


Being specific with property types allows you to reduce gather-and-inspect behavior. In this case, if you are only interested in what libraries will be INSTALLED by an app, the specific types on the right saves you some wasted traversal.

But…​ not too specific


But it is possible to be too specific! The model on the right makes it impossible to write generalized queries. If you want to find every person who works at a given address, you would need to write a massive WHERE clause (using OR) to include every person’s name in the relationship type

No symmetric relationships


Semantically symmetric relationships present two problems.

First, they are a form of needless data duplication. PARENT_OF and CHILD_OF mean exactly the same thing. You cannot have one be true and the other not.

Second, they allow you to violate the Cypher expectation of relationship uniqueness. Semantically, you have two identical relationships—​they just look different technically. This allows you to traverse the same relationship twice.

Relationships require space in the graph so minimizing their numbers is always a good thing.

Semantics of symmetry are important


Not all mutual relationships are semantically symmetric.

Here is an example where the direction of the FOLLOWS relationship on Twitter is significant. It matters who has followed who.

Using types vs. properties


Both of these models represent the same idea in different ways. Neither is strictly superior; both are optimized for a certain kind of question.

For example, if you want to find all LOVES relationships, but ignore the weaker LIKES ones, the top model is better. Traversal will not involve any gather-and-inspect. On the other hand, if you want to rank the strength of all relationships, the below model is better. The Cypher string required is much simpler, and we know there will not be any gather-and-inspect discards because we want everything anyway.

Property best practices

  • Property lookups have a cost.

  • Parsing a complex property adds more cost.

  • Anchors and properties used for traversal will be as simple as possible.

  • Identifiers, outputs, and decoration are OK as complex values.

When talking about properties, every best practice has exceptions. In the case of property value complexity, it depends on how the property is used.

Anchors and traversal paths that use property values need to be parsed at query time. If the cost of parsing them is high, querying slows down. Other properties, though, just need to be returned as-is because they do not require parsing. Complexity there does not matter much.

In other words, this "bad model" from before might actually be OK. If those addresses are never used as an anchor or for determining how to traverse the graph, and only as an output, then their complexity will not be an issue.

Data accessibility

For each query, how much work must Neo4j do to evaluate if the traversal represents a “good” path or a “bad” one?


Neo4j allows you to put data in labels, properties, and relationship types. In most cases, there is a way you can model your graph so that data being queried is in any one of these three locations.

When designing or refactoring a model, you can estimate performance by checking how “accessible” the data is that represents the important queries of your application.

Check your understanding

Question 1

What are some benefits of using fanout for your nodes?

Select the correct answers.

  • Reduces the number of nodes in the graph.

  • Reduces duplication of property values.

  • Reduces the number of relationships defined in the graph.

  • Reduces gather-and-inspect traversals during a query.

Question 2

Why is naming relationship types to be as specific as possible a benefit?

Select the correct answers.

  • Reduces the number of relationships in the graph.

  • Reduces traversals through nodes that are not necessary for the query.

  • Reduces gather-and-inspect traversals during a query.

  • Reduces the number of nodes in the graph.

Question 3

Which data is most accessible for your queries?

Select the correct answers.

  • Anchor node label

  • Anchor node property that has an index

  • Node property downstream that has an index

  • Relationship properties


You can now:

  • Describe graph data modeling best practices for modeling:

    • Nodes (entities)

    • Relationships

    • Properties

  • Describe data accessibility