End-to-end workflow

A realistic end-to-end workflow typically involves the use of multiple algorithms in a sequence. This example shows how to create a simple product recommendation engine based on collaborative filtering with the following steps:

  • Create a graph of products and customers.

  • Compute and add node embeddings to the graph with the FastRP algorithm.

  • Compute the similarity score for each pair of customers with the k-Nearest Neighbors (kNN) algorithm based on the node embeddings.

  • Find similar customers and recommend products using Cypher queries.

Create the graph

The following Cypher query creates an example graph of products and customers in the Neo4j database. The amount relationship property represents the average weekly amount of money spent by a customer on a given product.

CREATE
  (dan:Person {name: 'Dan'}),
  (annie:Person {name: 'Annie'}),
  (matt:Person {name: 'Matt'}),
  (jeff:Person {name: 'Jeff'}),
  (brie:Person {name: 'Brie'}),
  (elsa:Person {name: 'Elsa'}),

  (cookies:Product {name: 'Cookies'}),
  (tomatoes:Product {name: 'Tomatoes'}),
  (cucumber:Product {name: 'Cucumber'}),
  (celery:Product {name: 'Celery'}),
  (kale:Product {name: 'Kale'}),
  (milk:Product {name: 'Milk'}),
  (chocolate:Product {name: 'Chocolate'}),

  (dan)-[:BUYS {amount: 1.2}]->(cookies),
  (dan)-[:BUYS {amount: 3.2}]->(milk),
  (dan)-[:BUYS {amount: 2.2}]->(chocolate),

  (annie)-[:BUYS {amount: 1.2}]->(cucumber),
  (annie)-[:BUYS {amount: 3.2}]->(milk),
  (annie)-[:BUYS {amount: 3.2}]->(tomatoes),

  (matt)-[:BUYS {amount: 3}]->(tomatoes),
  (matt)-[:BUYS {amount: 2}]->(kale),
  (matt)-[:BUYS {amount: 1}]->(cucumber),

  (jeff)-[:BUYS {amount: 3}]->(cookies),
  (jeff)-[:BUYS {amount: 2}]->(milk),

  (brie)-[:BUYS {amount: 1}]->(tomatoes),
  (brie)-[:BUYS {amount: 2}]->(milk),
  (brie)-[:BUYS {amount: 2}]->(kale),
  (brie)-[:BUYS {amount: 3}]->(cucumber),
  (brie)-[:BUYS {amount: 0.3}]->(celery),

  (elsa)-[:BUYS {amount: 3}]->(chocolate),
  (elsa)-[:BUYS {amount: 3}]->(milk)

The graph looks as follows:

Visualization of the example graph

The next query creates an in-memory graph called purchases from the Neo4j graph. The only difference from the original data is that the orientation of the :BUYS relationships is discarded; this is because undirected relationships are the default choice when using the FastRP algorithm.

MATCH (source:Person)-[r:BUYS]->(target:Product)
RETURN gds.graph.project(
  'purchases',
  source,
  target,
  {
    sourceNodeLabels: labels(source),
    targetNodeLabels: labels(target),
    relationshipType: 'BUYS',
    relationshipProperties: r { .amount }
  },
  { undirectedRelationshipTypes: ['BUYS'] }
)

Add embeddings to the graph

Node embeddings are commonly used to capture topological information from the graph for further processing, for example to be used by another algorithm.

GDS offers several algorithms to compute embeddings, and FastRP is a good default choice to start with. Since the embeddings must be available to the kNN algorithm later on, the algorithm must run in mutate mode to add them to the purchases graph.

CALL gds.fastRP.mutate(  (1)
  'purchases',  (2)
  {  (3)
    embeddingDimension: 4,
    iterationWeights: [0.8, 1, 1, 1],
    relationshipWeightProperty: 'amount',
    randomSeed: 42,
    mutateProperty: 'embedding'
  }
)
YIELD nodePropertiesWritten
1 gds.fastRP algorithm running in mutate mode.
2 Name of the projected graph to run the algorithm on and add the new node property to.
3 Configuration parameters listed in the Syntax section of the algorithm (Mutate mode panel). Here, the embeddingDimension is set to 4 because the graph is small, the iterationWeights are chosen empirically to yield sensible results, and the relationshipWeightProperty is set to compute weighted averages of the neighboring embeddings. The randomSeed is added to obtain the same results for every run, but is not necessary for the actual computation. The mutateProperty is the new node property that will contain the node embedding.
Query result
nodePropertiesWritten

13

You can run the algorithm in stream mode as in the Basic workflow example by using the corresponding gds.fastRP.stream procedure and removing the mutateProperty from the configuration parameters.

Compute and write similarities

With the embeddings available as the new embedding node property, you can run the kNN algorithm to compute a similarity score between each pair of nodes. Run the kNN algorithm in write mode to add the score relationship to the Neo4j database and use it in Cypher queries.

CALL gds.knn.write(  (1)
  'purchases',  (2)
  {  (3)
    nodeProperties: ['embedding'],
    nodeLabels: ['Person'],
    topK: 2,
    sampleRate: 1.0,
    deltaThreshold: 0.0,
    randomSeed: 42,
    concurrency: 1,
    writeProperty: 'score',
    writeRelationshipType: 'SIMILAR'
  }
)
YIELD similarityDistribution
RETURN similarityDistribution.mean AS meanSimilarity  (4)
1 gds.knn algorithm running in write mode.
2 Name of the projected graph to run the algorithm on. The write mode does not update the in-memory graph.
3 Configuration parameters listed in the Syntax section of the algorithm (Write mode panel). The nodeLabels option is set to ['Person'], because we are interested only in Person-Person similarities and don’t want to mix up such similarities with e.g. Person-Product similarities. Here, topK is set to 2 to select only the two closest targets of a source node, while the sampleRate and the deltaThreshold are set to 1 and 0 respectively because the graph is small. The concurrency and the randomSeed are set to obtain the same results for every run, but are not necessary for the actual computation. The two write properties are used to write a new :SIMILAR relationship with a score property containing the similarity score between two nodes.
4 mean is one of the fields of the returned similarityDistribution map.
Query result
meanSimilarity

0.8800284068

The mean similarity between nodes is high. This is because the similarity distribution is computed after selecting the topK=2 most similar targets per source node. Moreover, the graph has two clusters of users with quite similar purchases: (Brie, Matt and Annie) and (Elsa, Dan and Jeff). Using a higher value for topK would yield a lower average similarity.

If we were to examine Person-Product similariteies, this could be accomplished by using instead Filtered K-Nearest Neighbors.

Find the most similar nodes

After writing the similarity relationships to Neo4j, you can use Cypher to find pairs of customers and rank them by their similarity score.

MATCH (n:Person)-[r:SIMILAR]->(m:Person)
RETURN n.name AS person1, m.name AS person2, r.score AS similarity
ORDER BY similarity DESCENDING, person1, person2
Query result
person1 person2 similarity

"Dan"

"Elsa"

0.9866833091

"Elsa"

"Dan"

0.9866833091

"Brie"

"Matt"

0.9740184546

"Matt"

"Brie"

0.9740184546

"Annie"

"Matt"

0.9724045992

"Matt"

"Annie"

0.9724045992

"Annie"

"Brie"

0.9154552221

"Brie"

"Annie"

0.9154552221

"Jeff"

"Annie"

0.8667784333

"Jeff"

"Matt"

0.7591181397

"Dan"

"Jeff"

0.6660436392

"Elsa"

"Jeff"

0.5712890029

The query result shows that the nodes named "Dan" and "Elsa" are very similar. In fact, they are both connected to three :Product nodes, two of which are the same (the nodes named "Milk" and "Chocolate") and with similar amounts. The "Cookies" Product is only purchased by Dan, but with a lower amount, and it also has a certain degree of similarity to the other products due to its proximity in the graph and partly due to randomness.

Make recommendations

The basic assumption of collaborative filtering is that products purchased by a customer may be of interest for a similar customer who is not already buying them. Knowing that "Annie" and "Matt" are similar, you can make product recommendations for each of them with a Cypher query.

MATCH (:Person {name: "Annie"})-->(p1:Product)
WITH collect(p1) AS products
MATCH (:Person {name: "Matt"})-->(p2:Product)
WHERE NOT p2 IN products
RETURN p2.name AS recommendation
Query result
recommendation

"Kale"

The query looks for products that "Annie" is buying, then selects the products that "Matt" is buying which "Annie" is not buying yet. The resulting node "Kale" is therefore the recommended product for "Annie".

Next steps

Although this is a simplified example, the general workflow is similar for many GDS use cases.

Try a different embedding algorithm or a different configuration for the kNN algorithm. For example, you can try increasing the topK parameter to find more similar customers for each customer.