Refactoring Graphs

About this module

At the end of this module, you will be able to:

  • Create constraints that help with performance of node creation and queries.

  • Determine if a query can be improved.

  • Write Cypher code to refactor the data model in the graph.

  • Create indexes that will help with query performance.

  • Refactor a graph by creating intermediate nodes.

  • Refactor a graph by specializing relationships.

  • Perform batch processing to refactor a large graph.

Because the code examples in this lesson modify the database, it is recommended that you do not execute them against your database as you will be doing so in the hands-on exercises.

Evolving the model

After profiling queries and determining that they are not performing optimally, you must rethink your graph data model. You do so by examining how queries perform, especially with data at scale.

In the previous lesson, we saw that the query to retrieve the airports and flight information for flight number 1016 for airline WN ended up making two passes through the graph to perform the query. Our small database has only 100 flights in it, but it required over 4,000 db hits to perform the query.

To solve this problem, we introduce the intermediate node, Flight that is based upon the properties of the CONNECTED_TO relationship.

Here is how the model will evolve:

Refactor1

Refactoring steps

In most cases, you refactor the the implementation of the graph data model with these steps:

  1. Create constraints as needed.

  2. Execute the refactor:

    1. MATCH the data you wish to move.

    2. Create new nodes.

    3. Create new relationships.

  3. Create indexes as needed.

  4. PROFILE all queries against the new model.

If the new model performs well for all queries, delete the old model. Otherwise, leave both in place.

Refactor details

Here are the details that show the CONNECTED_TO relationship properties and how they will be refactored as the Flight properties.

Refactor2

Notice that we will be adding a unique identifier for a Flight node, flightId.

Unique FlightId

Using a unique identifier for a flight enables us to quickly find a flight by its ID during a query. We create a uniqueness constraint for that property.

The flight number for a particular airline may not be unique, especially if we take into account the origin and destination airports. That is, the same flight number may be used for different from and to airports. In addition, we want to uniquely identify the date for the flight.

The Flight.flightId property will contain the following data which will be unique for all Flight nodes:

  • airline

  • flightNumber

  • code for the origin Airport

  • code for the destination Airport

  • date

Creating the constraint

Before we perform the refactoring of the existing graph, we add the uniqueness constraint:

CREATE CONSTRAINT Flight_flightId_constraint ON (f:Flight)
       ASSERT f.flightId IS UNIQUE

Refactoring the graph

Here is the code to refactor the graph:

MATCH (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
MERGE (newFlight:Flight {flightId: connection.airline + connection.flightNumber +
       '_' + connection.date +  '_' + origin.code + '_' + destination.code })
ON CREATE SET newFlight.date = connection.date,
              newFlight.airline = connection.airline,
              newFlight.number = connection.flightNumber,
              newFlight.departure = connection.departure,
              newFlight.arrival = connection.arrival
MERGE (origin)<-[:ORIGINATES_FROM]-(newFlight)
MERGE (newFlight)-[:LANDS_IN]->(destination)

This code looks scary at first, but it relies entirely on Cypher code you already know well.

It follows our refactoring steps:

  1. Use MATCH to fetch the data we wish to move. We are looking to move the CONNECTED_TO relationship data, so we create a query that will fetch every such relationship.

  2. Use MERGE to create the new Flight nodes. As is always the best practice with MERGE, we only include the unique identifier, flightId in the MERGE statement, where that property is a concatenation of many individually-non-unique things. All other Flight properties are set in a subsequent ON CREATE SET clause, which maps 1:1 from the CONNECTED_TO properties.

  3. Use MERGE to create the relationships connecting the new nodes. We need to do this twice, since Flights have both an ORIGINATES_FROM and LANDS_IN relationship.

Index for query performance

A very common qualifier for a query is the flight number. We want lookups of a Flight by its number to be fast. A flight number is not unique, but we want to be able to look up all flights with that number quickly.

Refactor2

You create the indexes after your have executed the refactoring code. This is because index maintenance is expensive and you do not want a large refactoring to take a long time to execute.

Creating the index

Here is the code to create the index on the flight number property:

CREATE INDEX Flight_number_index FOR (f:Flight) ON (f.number)

Profiling the query

After you have refactored the graph, you then profile the query that you are hoping to improve.

Recall that the query is: What are the airports and flight information for flight number 1016 for airline WN?

Here is the original query:

PROFILE
MATCH  (origin:Airport)-[connection:CONNECTED_TO]->(destination:Airport)
WHERE connection.airline = 'WN' AND connection.flightNumber = '1016'
RETURN origin.code, destination.code, connection.date, connection.departure, connection.arrival

We must change the query to work with the new model as follows:

PROFILE
MATCH (origin)<-[:ORIGINATES_FROM]-(flight:Flight)-
      [:LANDS_IN]->(destination)
WHERE flight.airline = 'WN' AND
      flight.number = '1016' RETURN origin, destination, flight

Result of the profile

ProfileAfterRefactor

Here we see that we improved the query to require only 34 db hits. Notice also that the index is being used at the beginning of the query execution plan to anchor the query. It is much better than doing a NodeByLabelScan.

Cleaning up the graph

If you are satisfied that the new model performs best for your queries, you can then clean up the graph to remove the CONNECTED_TO relationships. Relationships, especially those with properties take up unnecessary space in the graph.

DeleteConnected_to
MATCH ()-[connection:CONNECTED_TO]->()
DELETE connection

Exercise 4: Creating Flight nodes from CONNECTED_TO relationships

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-neo4j-modeling-exercises

and follow the instructions for Exercise 4.

This exercise has 7 steps. Estimated time to complete: 30 minutes.

Adding another domain question

We need to add another question for our application:

As a frequent traveller I want to find flights from <origin> to <destination> on <date> so that I can book my business flight.

For example:

Find all the flights going from Los Angeles (LAS) to Chicago Midway International (MDW) on the 3rd January, 2019.

Implementing the query

Here is the query:

MATCH (origin:Airport {code: 'LAS'})
    <-[:ORIGINATES_FROM]-(flight:Flight)-[:LANDS_IN]->
    (destination:Airport {code: 'MDW'})
WHERE flight.date = '2019-1-3'
RETURN origin, destination, flight
You will work with this query in the next exercise where you have loaded 10k nodes into the graph.

Performing another refactor

We want to introduce AirportDay nodes so that we do not have to scan through all the flights going from an airport when we are only interested in a subset of them.

This is an instance where we do not want to remove the relationships between airports and flights because we need them for our first query "What are the airports and flight information for flight number 1016 for airline WN?".

Refactor4
You only pull out a node if you are going to query through it, otherwise a property will suffice.

You must not be too aggressive with creating nodes from other nodes. If you pull out every single property and create nodes then you end up with an RDF model and lose the benefit of the property graph.

Refactor details

Refactor4Details

In this case we are adding the AirportDay node that will have date information. That way we do not have to go through the Flight nodes to find a date. Just like the Flight nodes, it will have a unique ID, AirportDay.airportDayId so that it can use it in the query.

CREATE CONSTRAINT AirportDay_airportDayId_constraint ON (a:AirportDay)
       ASSERT a.airportDayId IS UNIQUE

Then, you will want to:

  • MATCH the data you want to move.

  • Create the new AirportDay nodes.

  • Connect the new nodes to the existing graph.

Refactor implementation

MATCH (origin:Airport)<-[:ORIGINATES_FROM]-(flight:Flight)-
      [:LANDS_IN]->(destination:Airport)
MERGE (originAirportDay:AirportDay {airportDayId: origin.code + '_' + flight.date})
SET originAirportDay.date = flight.date
MERGE (destinationAirportDay:AirportDay
      {airportDayId: destination.code + '_' + flight.date})
SET destinationAirportDay.date = flight.date
MERGE (origin)-[:HAS_DAY]->(originAirportDay)
MERGE (flight)-[:ORIGINATES_FROM]->(originAirportDay)
MERGE (flight)-[:LANDS_IN]->(destinationAirportDay)
MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

MERGE enables us to add a single AirportDay node per the airportDayId value and also ensure that only one relationship is created between a Flight and an AirportDay node.

Profile our first query

After a refactor, you must check that all queries perform OK. Here is our first query: What are the airports and flight information for flight number 1016 for airline WN?

PROFILE
MATCH (origin)<-[:ORIGINATES_FROM]-(flight:Flight)-
      [:LANDS_IN]->(destination)
WHERE flight.airline = 'WN' AND
      flight.number = '1016' RETURN origin, destination, flight
FirstQueryAfterRefactor4

This query previously had 34 db hits, but now has 181. But, we added another 10K nodes to the graph so this is a reasonable outcome.

Profile our original second query

Then we want to profile our second query after the refactor: Find all the flights going from Los Angeles (LAS) to Chicago Midway International (MDW) on the 3rd January, 2019.

PROFILE MATCH (origin:Airport {code: 'LAS'})
    <-[:ORIGINATES_FROM]-(flight:Flight)-
    [:LANDS_IN]->
    (destination:Airport {code: 'MDW'})
WHERE flight.date = '2019-1-3'
RETURN origin, destination, flight
OriginalSecondQueryAfterRefactor4

Here we see that the query has 5982 db hits, which is worse than the 5532 we had earlier before the refactor. This is because we need to incorporate the new model into the query.

Profile our revised second query

Here is a rewrite of the second query: Find all the flights going from Los Angeles (LAS) to Chicago Midway International (MDW) on the 3rd January, 2019. Due to the change in the model, we must rewrite the query as:

PROFILE MATCH (origin:Airport {code: 'LAS'})-
    [:HAS_DAY]->(:AirportDay {date: '2019-1-3'})<-
    [:ORIGINATES_FROM]-(flight:Flight),
    (flight)-[:LANDS_IN]->
    (:AirportDay {date: '2019-1-3'})<-
    [:HAS_DAY]-(destination:Airport {code: 'MDW'})
RETURN origin, destination, flight
SecondQueryAfterRefactor4

Here we see that the query has 4271 db hits, which is better than the 5532 we had earlier before the refactor. As the number of nodes and relationships grows in the graph, these performance differences will be significant.

What we have learned is that we have to change the model and the query.

Exercise 5: Creating the AirportDay node from the Airport and Flight nodes

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-neo4j-modeling-exercises

and follow the instructions for Exercise 5.

This exercise has 7 steps. Estimated time to complete: 30 minutes.

More questions for the model?

We now have a model that performs well for these questions:

  1. What are the airports and flight information for flight number xx for airline yy?

  2. Find all the flights going from xx to yy on the date zz.

What if we added this question: Which airport has the most incoming flights?

FinalModel

Our latest model serves our two questions very well. However, it does not do a good job with the new question. That is, we must leave the ORIGINATES_FROM and LANDS_IN relationships between Airport and Flight nodes in the graph. In this case, we leave both models in place, and use each one for the questions it is suited to.

Another question for the model

Suppose we added this question: What are the flights from LAS that arrive at MDW on 2019-1-3?

To answer this question with the current model our query would be:

PROFILE
MATCH (origin:Airport {code: 'LAS'})-[:HAS_DAY]->
      (originDay:AirportDay),
      (originDay)<-[:ORIGINATES_FROM]-(flight:Flight),
      (flight)-[:LANDS_IN]->(destinationDay),
      (destinationDay:AirportDay)<-[:HAS_DAY]-
      (destination:Airport {code: 'MDW'})
WHERE originDay.date = '2019-1-3' AND
destinationDay.date = '2019-1-3'
RETURN flight.date, flight.number, flight.airline,
flight.departure, flight.arrival
ORDER BY flight.date, flight.departure
LASToMDWBad

This query requires 6137 db hits with our current graph. It needs to traverse all the HAS_DAY relationships between the Airport and AirportDay nodes found.

Neo4j is optimized for searching by relationship types. As we add more data, the number of HAS_DAY relationships that we have to traverse increases.

If we have 10 years worth of data we have to traverse 3,650 relationships from the Airport to find the AirportDay that we are interested in.

Refactoring for specific relationships

A best practice for graph data modeling is to make relationships more specific if that will help with query performance. Here we can modify the HAS_DAY relationship to be ON_2019-1-3, ON_2019_104, etc.

SpecificRelationshipsModel

APOC to the rescue!

With APOC, you can create relationships based upon data in the graph.

Here is the syntax:

apoc.create.relationship(startNode(<relationship-variable>),
                         '<new-relationship-value>',
                         {<relationship-property list},
                         endNode(<relationship-variable>)
                         )
                         YIELD rel

Given a relationship variable between two existing nodes, this procedure enables you to create a new, custom relationship that could be based upon property values. Calling this procedure where rel is returned enables you to either return the new relationship created or return a count of the number of relationships created in the graph.

Creating specialized relationships with APOC

Here is the code to transform the HAS_DAY relationships to specific relationships:

MATCH (origin:Airport)-[hasDay:HAS_DAY]->(ad:AirportDay)
CALL apoc.create.relationship(startNode(hasDay),
                              'ON_' + ad.date,
                              {},
                              endNode(hasDay) ) YIELD rel
RETURN COUNT(*)

Refactoring result

Here is the result of the refactoring to create specific relationships:

APOCToRefactor

Does the query improve?

Since the model has changed, we need to rewrite the query:

PROFILE
MATCH (origin:Airport {code: 'LAS'})-[:`ON_2019-1-3`]->
      (originDay:AirportDay),
      (originDay)<-[:ORIGINATES_FROM]-(flight:Flight),
      (flight)-[:LANDS_IN]->(destinationDay),
      (destinationDay:AirportDay)<-[:`ON_2019-1-3`]
      -(destination:Airport {code: 'MDW'})
RETURN flight.date, flight.number, flight.airline,
flight.departure, flight.arrival
ORDER BY flight.date, flight.departure
AfterAPOCRefactor

This query required 4108 db hits, where previously, it required 6137 hits. You can imagine that a fully-loaded graph with years of data could be vastly improved with this type of refactoring.

Of course, you must do due diligence and ensure that all of the previous queries still perform well.

Exercise 6: Creating specific relationships

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-neo4j-modeling-exercises

and follow the instructions for Exercise 6.

This exercise has 2 steps. Estimated time to complete: 15 minutes.

Refactoring large graphs

Suppose you scale the test data in your graph, or you have a large production graph and a new question must be added that requires a change to the model. Refactoring a large graph has its challenges due to the amount of memory required to perform the refactor.

Cypher keeps all transaction state in memory while running a query, which is fine most of the time.

When refactoring the graph, however, this state can get very large and may result in an OutOfMemory exception.

You must adapt your heap size to match, or operate in batches. For example increase these values for the server in the neo4j.conf file:

  • dbms.memory.heap.initial_size=2G (default is 512m)

  • dbms.memory.heap.max_size=2G (default is 1G)

Batching the refactoring process

Here is one way that you can control how much work is done for a refactoring:

  1. Tag all the nodes we need to process with a temporary label (for example Process).

MATCH (f:Flight)
SET f:Process
  1. Iterate over a subset of nodes flagged with the temporary label (using LIMIT):

    1. Execute the refactoring code.

    2. Remove the temporary label from the nodes.

    3. Return a count of how many rows were processed in the iteration.

  2. Once the count reaches 0, then the refactoring is finished.

Example code for a batch

Here is code that will process all nodes with the label Process:

MATCH (flight:Process)
WITH flight LIMIT 500

MATCH (origin:Airport)<-[:ORIGINATES_FROM]-(flight)-[:LANDS_IN]->(destination:Airport)

MERGE (originAirportDay:AirportDay {airportDayId: origin.code + "_" + flight.date})
ON CREATE SET originAirportDay.date = flight.date

MERGE (destinationAirportDay:AirportDay {airportDayId: destination.code + "_" + flight.date})
ON CREATE SET destinationAirportDay.date = flight.date

MERGE (origin)-[:HAS_DAY]->(originAirportDay)
MERGE (originAirportDay)<-[:ORIGINATES_FROM]-(flight)
MERGE (flight)-[:LANDS_IN]->(destinationAirportDay)
MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

REMOVE flight:Process
RETURN count(*)

You have previously used this refactoring code to create the AirportDay nodes and their relationships to Airport and Flight nodes.

This is a variation of the code we executed previously to create the AirportDay nodes from Flight nodes. The highlighted areas will be part of the iteration where we do batches of 500 flights at a time and once the creation of the AirportDay node is completed, we remove the Process label from the Flight node.

Using apoc.periodoc.commit

APOCPeriodicCommit

APOC currently has 9 procedures that can help you with batching. We will focus on using apoc.periodic.commit().

Batching with APOC

CALL apoc.periodic.commit('
MATCH (flight:Process)
WITH flight LIMIT $limit

MATCH (origin:Airport)<-[:ORIGINATES_FROM]-(flight)-[:LANDS_IN]->(destination:Airport)

MERGE (originAirportDay:AirportDay {airportDayId: origin.code + "_" + flight.date})
ON CREATE SET originAirportDay.date = flight.date

MERGE (destinationAirportDay:AirportDay {airportDayId: destination.code + "_" + flight.date})
ON CREATE SET destinationAirportDay.date = flight.date

MERGE (origin)-[:HAS_DAY]->(originAirportDay)
MERGE (originAirportDay)<-[:ORIGINATES_FROM]-(flight)
MERGE (flight)-[:LANDS_IN]->(destinationAirportDay)
MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

REMOVE flight:Process
RETURN count(*)

',{limit:500}
)

Here we include the refactoring code for creating the AirportDay nodes/relationships in our call to apoc.periodic.commit() where we specify that the refactoring code will create 500 AirportDay nodes in a single transaction.

Result of the batch processing

APOCBatchExecution

Exercise 7: Refactoring large graphs

In the query edit pane of Neo4j Browser, execute the browser command:

:play 4.0-neo4j-modeling-exercises

and follow the instructions for Exercise 7.

This exercise has 8 steps. Estimated time to complete: 30 minutes.

Check your understanding

Question 1

Suppose you want to create Person nodes in the graph, each with a unique value for the personID property . What must you do to ensure that nodes are unique?

Select the correct answers.

  • Test the existence of the Person node with the personID property value before you use CREATE to create it.

  • Create an existence constraint for the personID property of the Person node.

  • Use MERGE to create the Person node with a unique property value specified for personID.

  • Create a uniqueness constraint for the personID property of the Person node.

Question 2

In many cases, how will refactoring change the graph?

Select the correct answers.

  • Add more nodes to the graph to represent the same data.

  • Reduce nodes by consolidating data.

  • Add more relationships to the graph.

  • Reduce the number of relationships in the graph.

Question 3

What is the recommended method for batching refactorings for a large graph?

Select the correct answer.

  • Use the FOREACH clause in Cypher.

  • Use a DO WHILE clause in Cypher.

  • Use db.iterate().

  • Use apoc.periodic.commit().

Summary

You can now:

  • Create constraints that help with performance of node creation and queries.

  • Determine if a query can be improved.

  • Write Cypher code to refactor the data model in the graph.

  • Create indexes that will help with query performance.

  • Refactor a graph by creating intermediate nodes.

  • Refactor a graph by specializing relationships.

  • Perform batch processing to refactor a large graph.