Managing Risk in a Manufacturing Plant With Neo4j Aura Graph Analytics

Photo of Corydon Baylor

Corydon Baylor

Sr. Manager, Technical Product Marketing, Neo4j

In this blog, we’ll show you how to run graph algorithms on a digital twin of your manufacturing processes using Neo4j Aura Graph Analytics. By representing different types of machines and their workflows as connected nodes and relationships, you gain a holistic view of how your entire ecosystem operates.

This connected perspective makes it easier to spot bottlenecks, anticipate equipment failures, and simulate disruptions before they happen. Graph algorithms can uncover hidden patterns and relationships, such as clusters of components that fail together or critical suppliers whose downtime would ripple through production. With this insight, manufacturers can move from reactive problem-solving to predictive maintenance, smarter supply chain planning, and more resilient operations.

Setting Up the Environment

First, we need to load our data into an Aura instance that supports Aura Graph Analytics. To do this, we need to do the following:

  1. Create a new database
  2. Create the Machine nodes using Machines.cypher
  3. Create the FEEDS_INTO relationships using Feed_Relationships.cypher

The analysis uses simulated manufacturing data loaded into Neo4j. There are two node types (Machine, Sensor) and two relationships (FEEDS_INTO for production flow, LOGS for sensor readings).

I’ve written this guide using Google Colab, so this tutorial contains references to Colab features.  However, you may run it in any Python environment that has access to the compute resources needed to execute our algorithms.

We need to install the graphdatascience package and load all of our secrets:

!pip install graphdatascience
from graphdatascience.session import GdsSessions, AuraAPICredentials, DbmsConnectionInfo, AlgorithmCategory
from neo4j import GraphDatabase
import pandas as pd
from datetime import timedelta
from google.colab import userdataCode language: JavaScript (javascript)

You must first generate your credentials in Neo4j Aura, then you can store your credentials securely using colab secrets:

# This crediential is the Organization ID
TENANT_ID=userdata.get('TENANT_ID')

# These credentials were generated after the creation of the Aura Instance
NEO4J_URI = userdata.get('NEO4J_URI')
NEO4J_USERNAME = userdata.get('NEO4J_USERNAME')
NEO4J_PASSWORD = userdata.get('NEO4J_PASSWORD')

# These credentials were generated after the creation of the API Endpoint
CLIENT_SECRET=userdata.get('CLIENT_SECRET')
CLIENT_ID=userdata.get('CLIENT_ID')
CLIENT_NAME=userdata.get('CLIENT_NAME')Code language: PHP (php)

Estimate resources based on graph size and create a session with a two‑hour TTL:

sessions = GdsSessions(api_credentials=AuraAPICredentials(CLIENT_ID, CLIENT_SECRET, TENANT_ID))

session_name = "demo-session"
memory = sessions.estimate(
    node_count=1000, relationship_count=5000,
    algorithm_categories=[AlgorithmCategory.CENTRALITY, AlgorithmCategory.NODE_EMBEDDING],
)

db_connection_info = DbmsConnectionInfo(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)Code language: JavaScript (javascript)

Then initialize the session itself:

# Initialize a GDS session scoped for 2 hours, sized to estimated graph
gds = sessions.get_or_create(
    session_name,
    memory=memory,
    db_connection=db_connection_info, # this is checking for a bolt server currently
    ttl=timedelta(hours=2),
)

print("GDS session initialized.")Code language: PHP (php)

We’re also going to include a helper function here:

# Helper: execute Cypher and return pandas DataFrame
def query_to_df(cypher: str, params=None):
    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
    with driver.session() as session:
        result = session.run(cypher, params or {})
        df = pd.DataFrame([r.data() for r in result])
    driver.close()
    return dfCode language: PHP (php)

Creating Our Graph Projection

We project all Machine and Sensor nodes and their FEEDS_INTO and LOG relationships into a single in-memory graph called full

Building one consistent projection ensures that every algorithm runs against the same structure. It helps to avoid analysis discrepancies and makes the results easier to interpret and compare. However, you can use multiple projections if you need to have your data structured differently for different sections of your analysis. 

# Define the custom Cypher query for projecting the graph

if gds.graph.exists("full")["exists"]:
    gds.graph.drop("full")

query = """
CALL {
    MATCH (source)-[rel]->(target)
    RETURN
        source,
        rel,
        target
}
RETURN gds.graph.project.remote(source, target, {
    sourceNodeLabels: labels(source),
    targetNodeLabels: labels(target),
    relationshipType: type(rel)
});

"""

# Project the graph into GDS
full_graph, result = gds.graph.project(
    graph_name="full",
    query=query
)Code language: PHP (php)

Connectivity Analysis

Structural connectivity examines how the nodes in your plant are linked together and identifies hidden structural risks. We use two complementary methods:

  • Weakly Connected Components (WCC) – WCC treats the graph as undirected, ignoring the flow of edges. It groups nodes that can reach each other regardless of direction. If your graph breaks into multiple WCC, it suggests segmented workflows or isolated equipment groups that may represent operational blind spots or disconnected production lines.
  • Strongly Connected Components (SCC) – SCC respects edge direction and identifies true directed loops (A→B→…→A). Cycles in production graphs often correspond to scrap-and-rework loops or inefficient recycling, which can cause hidden costs or production bottlenecks. Finding non-trivial SCC helps target areas for workflow correction.

WCC gives a high-level view of connectivity, highlighting whether your plant functions as one unified system. SCC drills down into specific cycle structures that could be costing efficiency.

Weakly Connected Components

We’re hoping to see a single connected component because it suggests an integrated production network. Multiple smaller components imply isolated lines or equipment clusters that may need integration or review.

Notice how we’re using the gds.wcc.write method? This means we’ll write the component ID back to our database:

gds.wcc.write(full_graph, writeProperty='wcc')

query = """
    MATCH (m:Machine)
    RETURN m.id AS machine, m.wcc AS component
    ORDER BY component, machine
    """
df = gds.run_cypher(query)

counts = df.groupby("component")["machine"].count().reset_index(name="machine_count")

countsCode language: PHP (php)
componentmachine_count
020

We did find that all of our machines fit into one WCC! Each machine node now has a component of 0.

Strongly Connected Components

Each SCC represents a set of machines with a directed path from any machine to every other in the group. Components with multiple machines often signal rework loops, material recycling paths, or cyclic flows that can slow production and waste capacity.

In this case, we’ll use the gds.scc.stream method to stream our results directly into a Python dataframe:

scc = gds.scc.stream(full_graph)

scc.groupby("componentId").filter(lambda g: len(g) > 1)Code language: JavaScript (javascript)
nodeIdcomponentId
46
56
79
89

Machines 4 and 5 form a closed loop (SCC 6), and machines 7 and 8 form another (SCC 9), meaning each pair feeds back into the other. These directed cycles often signal scrap‐and‐rework loops or inefficiencies that you’ll want to investigate and break.

Criticality Analysis

Identifying the most critical machines in your workflow can help avoid shutdowns. If these machines slow down or fail, downstream operations halt. 

We’ll use PageRank to identify and flag bottlenecks in our processes. Designed initially to rank web pages, PageRank measures a node’s importance by the quality and quantity of incoming edges. 

In our graph, an edge A→B means “Machine A feeds into Machine B.” A high PageRank score indicates a machine that receives material from many other well-connected machines.

res = gds.pageRank.stream(
    full_graph,
    maxIterations=20,
    dampingFactor=0.85
)

res.sort_values("score", ascending=False).head(5)Code language: PHP (php)
nodeIdscore
191.510188
181.229186
80.871840
70.849414
170.749385

Machine 19 sits at the heart of our production flow, receiving the most inbound throughput and so carries the greatest operational weight. Machines 18 and 8 also serve as major hubs, channeling significant material or information.

Structural Embeddings and Similarity

Getting an even deeper understanding of each machine’s workflow requires more than looking at direct connections, as we’ve done so far. Structural embeddings capture broader patterns by summarizing each machine’s position in the overall operation into a numeric vector. This allows you to:

  • Group machines with similar roles or dependencies
  • Identify candidates for backup or load balancing
  • Spot unusual machines that behave differently from the rest of the plant

We use embeddings to make these comparisons based on immediate neighbors and overall graph structure.

We’ll use two Graph Analytics algorithms:

  • Fast Random Projection (FastRP) – FastRP generates a compact 16-dimensional vector for each machine. These vectors are built by sampling the graph around each node, so two machines with similar surroundings will end up with similar embeddings.
  • k-Nearest Neighbors (kNN) – Finds, for each machine, the top k most similar peers based on cosine similarity of their embeddings.

Together, embeddings and KNN surface structural affinities beyond simple degree or centrality measures.

Fast Random Projection (FastRP) Embeddings

The results for FastRP are not immediately interpretable. However, machines with nearly identical embeddings have similar upstream and downstream relationships and likely play the same role in the plant. These embeddings are numerical representations that enable downstream clustering, similarity search, or anomaly detection:

# Run FastRP and write embeddings to each Machine node property 'embedding'
print("Running FastRP embeddings…")
res = gds.fastRP.write(
    full_graph,
    writeProperty='embedding',
    embeddingDimension=16,
    randomSeed=42
)



query="""
    MATCH (m:Machine)
    RETURN m.id AS machine, m.embedding AS embedding
    ORDER BY machine
    LIMIT 5
    """

gds.run_cypher(query)Code language: PHP (php)

   

machineembedding
1[-0.37796443700790405, 0.0, -0.277350097894668…
2[0.0, 0.0, 0.05598324537277222, 0.610683441162…
3[0.2182178944349289, 0.0, 0.5515512228012085, …
4[0.3333333432674408, 0.0, 0.7415816783905029, …
5[0.24253563582897186, 0.0, 0.6507839560508728,…

Our initial graph projection doesn’t include any property information, so we have to create a new graph projection that includes the new embedding property we created for any future downstream algorithms:

query = query="""
    CALL {
        MATCH (m1)
        WHERE m1.embedding IS NOT NULL
        OPTIONAL MATCH (m1)-[r]->(m2)
        where m2.embedding is not null
        RETURN m1 AS source, r AS rel, m2 AS target, {embedding: m1.embedding} AS sourceNodeProperties, {embedding: m2.embedding} AS targetNodeProperties
    }
    RETURN gds.graph.project.remote(source, target, {
      sourceNodeProperties: sourceNodeProperties,
      targetNodeProperties: targetNodeProperties,
      sourceNodeLabels: labels(source),
      targetNodeLabels: labels(target)
    })
    """

# Project the graph into GDS
embeddings_graph, result = gds.graph.project(
    graph_name="embeddings",
    query=query
)Code language: PHP (php)

k-Nearest Neighbors

Once we have embeddings for every machine, we can use kNN to find the most structurally similar machines based on their vector representations. This compares the cosine similarity between embeddings to pull out the top matches for each machine.

Machines with a similarity score close to 1.0 are operating in nearly identical parts of the workflow. These machines may be interchangeable, ideal backups for each other, or grouped for shared maintenance plans.

# Stream top-5 similar per machine
knn_stream = gds.knn.stream(
    embeddings_graph,                 # your already-bound handle
    nodeProperties=["embedding"],
    topK=5
)

knn_df = pd.DataFrame(knn_stream)

knn_df.sort_values("similarity", ascending=False).head(10)Code language: PHP (php)
node1node2similarity
1791.000000
9171.000000
340.964767
430.964767
670.923316

Each row with a similarity of 1.0 means those two machines occupy essentially the exact same structural “neighborhood” in your workflow graph. For example, Machines 9 and 17 form a tight clique of interchangeable roles. You can treat each of these clusters as functionally equivalent units – ideal candidates for load-balancing, redundancy checks, or targeted process tuning.

Finally, we must close the session and end our billing:

sessions.delete(session_name=session_name)

What’s Next

Now that you’ve got a solid grasp on creating a digital twin in manufacturing, head over to our GitHub repo for step-by-step instructions on how to do it yourself with Neo4j Aura Graph Analytics. You’ll find a Colab notebook, the full dataset, and everything you need to get started. 

Prefer working in Snowflake? You can run the same example there using Neo4j Graph Analytics for Snowflake.

Resources