Build Better Recommendations With Aura Graph Analytics

Photo of Corydon Baylor

Corydon Baylor

Sr. Manager, Technical Product Marketing, Neo4j

Create a free graph database instance in Neo4j AuraDB

Recommendations are big business. Amazon reports that 35 percent of its revenue comes from recommendations. Even more surprisingly, Netflix and YouTube report that 75 percent and 70 percent of what people watch on their platforms comes from recommendations. That means the majority of what we buy, watch, or even listen to is shaped by algorithms working quietly in the background.

But building a great recommendation engine isn’t just about throwing data into a model. It involves understanding user behavior, spotting patterns across millions of interactions, and surfacing just the right item at just the right time. The best systems feel almost magical — they seem to know what you want before you do.

In this blog, we’ll use Neo4j Aura Graph Analytics to build our recommendations. Graph-powered recommendations go deeper than traditional methods because they intuitively model user behavior. 

In our example, we’ll examine co-purchasing behavior using data sampled from Instacart. We’ll discover that simply examining items that are most frequently purchased together isn’t enough to make a good recommendation, and, interestingly, it might cause us to recommend products that customers were already planning on buying without our intervention.

So how do we build a good recommendation engine? What techniques power these systems, and how can you start applying them yourself? Follow along to find out!

Setup

Using this data, load your data into an AuraDB instance that has Graph Analytics installed using these Cypher statements.

Next, we create a Google Colab notebook that has everything we need to conduct our analysis. Then, we’ll focus on what we can do using Python. Follow along using the Colab notebook on GitHub.

Start by loading in the packages:

!pip install graphdatascience
from graphdatascience.session import DbmsConnectionInfo, AlgorithmCategory, CloudLocation, GdsSessions, AuraAPICredentials
from datetime import timedelta
import pandas as pd
import os
from google.colab import userdataCode language: Python (python)

Authentication

You must first generate your credentials in Neo4j Aura (afterward, you can store your credentials securely using Colab Secrets):

# For use in Google Colab
# This credential is the Organization ID
TENANT_ID=userdata.get('TENANT_ID')

# These credentials were generated after the creation of the Aura Instance
NEO4J_URI = userdata.get('RETAIL_URI')
NEO4J_USERNAME = userdata.get('NEO4J_USER')
NEO4J_PASSWORD = userdata.get('RETAIL_PASSWORD')

# These credentials were generated after the creation of the API Endpoint
CLIENT_SECRET=userdata.get('CLIENT_SECRET')
CLIENT_ID=userdata.get('CLIENT_ID')Code language: Python (python)

Establishing a Session

Estimate resources based on graph size and create a session with a two‑hour TTL:

sessions = GdsSessions(api_credentials=AuraAPICredentials(CLIENT_ID, CLIENT_SECRET, TENANT_ID))

session_name = "demo-session"
memory = sessions.estimate(
    node_count=1000, relationship_count=5000,
    algorithm_categories=[AlgorithmCategory.CENTRALITY, AlgorithmCategory.NODE_EMBEDDING],
)

db_connection_info = DbmsConnectionInfo(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)

# Initialize a GDS session scoped for 2 hours, sized to estimated graph
gds = sessions.get_or_create(
    session_name,
    memory=memory,
    db_connection=db_connection_info, # this is checking for a bolt server currently
    ttl=timedelta(hours=2),
)

print("GDS session initialized.")Code language: Python (python)

Our Data

We’re going to work with an Instacart dataset that looks at different orders from a grocery store. We’ll use this to create product recommendations.

Our data is currently in this format:

(User)-[PLACED]->(Order)-[CONTAINS]->(Product)-[IN_AISLE]->(Aisle)-[IN_DEPARTMENT]->(Department)Code language: CSS (css)

For our purposes, we’re going to work with a subset of a co-purchase graph we create. We’ll see what items are purchased in the same market basket. But before we do that, let’s take a look at the top items that are co-purchased together:

query = """MATCH (o:Order)-[:CONTAINS]->(p1:Product),
      (o)-[:CONTAINS]->(p2:Product)
WHERE id(p1) < id(p2)
RETURN p1.name AS productA,
       p2.name AS productB,
       count(*) AS timesTogether
ORDER BY timesTogether DESC
LIMIT 5;"""

df = gds.run_cypher(query)
dfCode language: PHP (php)
productAproductBtimesTogether
Organic Baby SpinachBanana24
Bag of Organic BananasOrganic Hass Avocado22
Bag of Organic BananasOrganic Strawberries19
BananaOrganic Avocado16
Organic StrawberriesOrganic Hass Avocado15

You’ll notice that four out of five of the top co-purchased pairs include bananas. Logically, I suppose this means that grocery stores should nearly always recommend bananas to customers, but should they?

Let’s keep working through and see if that holds up.

Creating a Graph Projection

Now that we have a good idea of what’s in our data, let’s break down the query needed to generate a graph projection.

For the inner projection:

This will look pretty familiar to the query we used to generate our co-purchase pairs. First, we find orders that contain two given products, then we add a filter that ensures that we don’t have duplicate pairs:

MATCH (o:Order)-[:CONTAINS]->(p1:Product),
      (o)-[:CONTAINS]->(p2:Product)
WHERE p1.product_id < p2.product_idCode language: CSS (css)

We use < to ensure that we don’t count each pair twice. So if we had Milk (id: 1) and Bread (id: 2), we get (1, 2) but not also (2, 1).

Next, we count the number of times that combination occurs and save it as a float:

WITH p1, p2, toFloat(COUNT(*)) AS weightCode language: PHP (php)

Finally, we return our values:

RETURN p1 AS source, p2 AS target, weightCode language: PHP (php)

RETURN p1 AS source, p2 AS target, weight

For the outer projection: 

We grab what we returned from the inner projection, starting with the start and end nodes for our new relationship:

RETURN gds.graph.project.remote(
    source,
    target,
    {
    sourceNodeLabels: labels(source),
    targetNodeLabels: labels(target),Code language: CSS (css)

Then we give a name to the relationship:

relationshipType: 'CO_PURCHASED_WITH',Code language: JavaScript (javascript)

Finally, we pull in our weight property. Because we named our count as weight in the triplet from the inner query (source, target, weight), we have to access it like so:

relationshipProperties: {weight: weight}Code language: CSS (css)

So altogether, it looks like this:

if gds.graph.exists("copurchase")["exists"]:
    gds.graph.drop("copurchase")

query = """
CALL {
    MATCH (o:Order)-[:CONTAINS]->(p1:Product),
          (o)-[:CONTAINS]->(p2:Product)
    WHERE p1.product_id < p2.product_id
    WITH p1, p2, toFloat(COUNT(*)) AS weight
    RETURN p1 AS source, p2 AS target, weight
}
RETURN gds.graph.project.remote(
    source,
    target,
    {
        sourceNodeLabels: labels(source),
        targetNodeLabels: labels(target),
        relationshipType: 'CO_PURCHASED_WITH',
        relationshipProperties: {weight:weight} 
        }
    
);
"""

copurchase_graph, result = gds.graph.project(
    graph_name="copurchase",
    query=query
)Code language: PHP (php)

Node Similarity

We’ll run nodeSimilarity to see which products tend to be purchased together. But before we do that, let’s break down how this algorithm works. Imagine we had only three baskets:

BasketProducts
1Eggs, Bread, Milk
2Bread, Milk, Butter
3Eggs, Bread, Butter

We can see that eggs tend to be purchased with bread, and butter tends to be purchased with bread. Therefore, we can say eggs and butter are likely to be purchased together since they commonly share neighbors in their baskets (in this case, bread). Items that share many neighbors are considered to be similar to each other.

That’s half of how that algorithm works. Bread in the above example actually appears in every basket, so it really doesn’t provide us that much information (since it’s a shared neighbor with every item). Luckily, nodeSimilarity discounts items that have too many neighbors.

Now imagine this case:

BasketProducts
1Eggs, Bread, Milk
2Bread, Milk, Butter
3Eggs, Bread, Butter
4Wine, Crackers, Bread
5Cheese, Crackers, Bread

Here, the similarity between cheese and wine will largely be driven by the presence of crackers rather than bread because bread is mostly noise (and empty calories). Again, this is because bread is in every basket and, therefore, its presence isn’t a good predictor of what other items will be in the basket.

Let’s run the algorithm:

result = gds.nodeSimilarity.write(
    copurchase_graph,
    topK=10,
    relationshipWeightProperty="weight",
    writeRelationshipType="SIMILAR_TO",
    writeProperty="score"
)Code language: JavaScript (javascript)

Then we look at our results:

query = """
MATCH (p1:Product)-[r:SIMILAR_TO]->(p2:Product)
RETURN p1.product_id AS product1_id, p1.name AS product1_name,
       p2.product_id AS product2_id, p2.name AS product2_name,
       r.score AS similarity_score
ORDER BY similarity_score 
LIMIT 10
"""

gds.run_cypher(query)Code language: PHP (php)
product1_idproduct1_nameproduct2_idproduct2_namesimilarity_score
41162Grapes Certified Organic California Black Seed…13176Bag of Organic Bananas0.000856
41605Chocolate Bar Milk Stevia Sweetened Salted Almond13176Bag of Organic Bananas0.000856
31478DairyFree Cheddar Style Wedges13176Bag of Organic Bananas0.000856
46900Organic Chicken Noodle Soup24852Banana0.000887
45265Raspberry on the Bottom NonFat Greek Yogurt24852Banana0.000887

We calculated some similarity scores, but what exactly do they mean? The numerator counts how often two products appear in the same basket, and the denominator measures how often each product appears overall. Together, the similarity score shows how strongly two items are connected once popularity is factored out, so higher scores mean they’re often bought together, while lower scores mean the connection is weak or random.

In our actual dataset, you’ll notice that bananas are co-purchased with nearly everything — and for that reason, they mostly represent noise.

In fact, if you look at the least similar items, bananas often top the list. Why? Because their presence in a basket doesn’t really signal that the customer wants any other specific item — they’re just a frequent, general-purpose purchase.

So while bananas are everywhere, they tell us almost nothing about co-purchase patterns — and nodeSimilarity correctly learns to downweight them. Plus, since bananas are in nearly every shopping cart, our theoretical customer was likely to buy them regardless of whether or not we recommended them, which is why using nodeSimilarity provides some value over simply looking at what items are co-purchased together the most.

Summary

At this point, we’ve seen what makes a bad recommendation — but what makes a good one? Let’s look at what coupons our system would suggest if a customer bought Peanut Butter Cereal (34), Organic Bananas (13176), and Cauliflower (5618).

A weaker model might recommend something like Organic Strawberries, simply because they frequently appear alongside bananas. But a graph-based approach looks deeper. It recognizes that the similarity score for strawberries is driven by a universally popular item — bananas — which doesn’t tell us much about this specific shopper.

Instead, the algorithm surfaces Organic Pepper Jack Cheese — a connection rooted in Cauliflower, an item that’s more distinctive to our customers’ preferences. In other words, node similarity filters out noisy, generic associations (like “bananas go with everything”) and highlights patterns that are more meaningful and personalized.

query = """
MATCH (p1:Product)-[r:SIMILAR_TO]->(p2:Product)
WHERE p1.product_id IN [34, 13176, 5618]
WITH p1, p2, r
ORDER BY p1.product_id, r.score DESC
WITH p1, collect({
    product2_id: p2.product_id,
    product2_name: p2.name,
    similarity_score: r.score
})[0..3] AS top_similar
UNWIND top_similar AS similar
RETURN p1.product_id AS product1_id,
       p1.name AS product1_name,
       similar.product2_id AS product2_id,
       similar.product2_name AS product2_name,
       similar.similarity_score AS similarity_score
ORDER BY p1.product_id, similarity_score DESC
"""
result = gds.run_cypher(query)
resultCode language: PHP (php)
product1_idproduct1_nameproduct2_idproduct2_namesimilarity_score
34Peanut Butter Cereal148Nectarines0.416667
34Peanut Butter Cereal148Nectarines0.416667
34Peanut Butter Cereal148Nectarines0.416667
5618Cauliflower4872Organic Pepper Jack Cheese0.973684
5618Cauliflower4872Organic Pepper Jack Cheese0.973684
5618Cauliflower4872Organic Pepper Jack Cheese0.973684
13176Bag of Organic Bananas21137Organic Strawberries0.256920
13176Bag of Organic Bananas21137Organic Strawberries0.256920
13176Bag of Organic Bananas21137Organic Strawberries0.256920

Finally, we end our session:

sessions.delete(session_name=session_name)

With that, you’ve learned how to build a better recommendation, one that goes deeper than simply looking at which items are frequently purchased together. From this, you can recommend interesting and unique products your customers really want to buy.

Resources