Tutorial: Applied Graph Embeddings

Goals
This guide provides a hands on walk through of the node2Vec graph embedding algorithm in the Neo4j Data Science Library.
Prerequisites
Please have Neo4j (version 4.0 or later) and Graph Data Science Library (version 1.3 or later) downloaded and installed to use graph embeddings. You will also need to have Python installed to follow the second half of this guide.

Intermediate

Graph embeddings were introduced in version 1.3 of the Graph Data Science Library (GDSL). They can be used to create a fixed size vector representation for nodes in a graph. In this guide we’ll learn how to use these algorithms to generate embeddings and how to interpret them using visualization techniques.

The code examples used in this guide can be found in the neo4j-examples/graph-embeddings GitHub repository. For background reading about graph embeddings, see the Graph Embeddings Developer Guide.

European Roads dataset

We’re going to use a dataset of European Roads compiled by Lasse Westh-Nielsen and described in more detail in his blog post. The dataset contains 894 towns and 1,250 roads connecting them. We also have the country to which a town belongs. Instructions for importing the data are in the project repository.

We can see the schema in the diagram below:

eroads new schema
Figure 1. European Roads Graph Schema

This diagram was generated by running CALL db.schema.visualization() in the Neo4j Browser after importing the data.

In the next section we’re going to run graph embeddings over the towns and roads to generate a vector representation for each town.

Running graph embeddings

We’re going to use the node2vec algorithm, which is one of 3 embedding algorithms available in the Graph Data Science Library. node2vec creates embeddings based on biased random walks of a node’s neighborhood.

We’re going to first run the streaming version of this procedure, which returns a stream of node ids and embeddings. The algorithm has the following mandatory config:

nodeProjection

the node labels to use for our projected graph

relationshipProjection

the relationship types to use for our projected graph

embeddingSize

the size of the vector/list of numbers to create for each node

iterations

the number of iterations to run

We’ll also specify walkLength, which defines the number of steps in each random walk. The default value is 80, which feels too big for such a small graph.

We can run the algorithm with the following query:

CALL gds.alpha.node2vec.stream({
  nodeProjection: "Place",
  relationshipProjection: {
    eroad: {
      type: "EROAD",
      orientation: "UNDIRECTED"
    }
  },
  embeddingSize: 10,
  iterations: 10,
  walkLength: 10
})
YIELD nodeId, embedding
RETURN gds.util.asNode(nodeId).name AS place, embedding
LIMIT 5;

In relationshipProjection, we specify orientation: "UNDIRECTED" so that the direction of the EROAD relationship type is ignored on the projected graph that the algorithm runs against.

If we run the query, it will return the following output:

Table 1. Results
place embedding

"Larne"

[2.1327764987945557, 1.0994384288787842, 0.49434158205986023, 2.040088176727295, 1.0031780004501343, 0.873113751411438, -2.413508176803589, -2.3154311180114746, 0.9832170009613037, -2.1525325775146484]

"Belfast"

[2.593827962875366, 0.7630942463874817, 0.7039147019386292, 2.3797214031219482, 0.7776350378990173, 0.5327662825584412, -2.3597097396850586, -1.9403077363967896, 1.2757759094238281, -1.681670904159546]

"Dublin"

[1.9023665189743042, 0.795767068862915, 0.722218930721283, 2.71242094039917, 0.21453168988227844, 0.3935716152191162, -2.1960527896881104, -2.6851491928100586, 1.0708848237991333, -0.6451507806777954]

"Wexford"

[2.0736780166625977, 1.1650514602661133, 0.4161956012248993, 2.8500888347625732, -0.12804043292999268, 0.355782151222229, -2.719728946685791, -2.6983509063720703, 0.5993242859840393, 0.265157550573349]

"Rosslare"

[1.8750609159469604, 1.2445515394210815, 0.630532443523407, 2.4588329792022705, 0.1135958656668663, 0.007978626526892185, -2.7186481952667236, -2.418004035949707, 0.3507269024848938, 0.9638977646827698]

This procedure is non deterministic, so we’ll get different results each time that we run it.

Everything looks fine so far, we’ve been successful in returning embeddings for each node.

Further exploration will be easier if we store the embeddings in Neo4j, so we’re going to do that using the write version of the procedure. We can store the embeddings by running the following query:

CALL gds.alpha.node2vec.write({
   nodeProjection: "Place",
   relationshipProjection: {
     eroad: {
       type: "EROAD",
       orientation: "UNDIRECTED"
    }
   },
   embeddingSize: 10,
   iterations: 10,
   walkLength: 10,
   writeProperty: "embeddingNode2vec"
});
Table 2. Results
nodeCount nodePropertiesWritten createMillis computeMillis writeMillis configuration

894

894

54

4388

57

{initialLearningRate: 0.025, writeConcurrency: 4, negativeSamplingRate: 5, walksPerNode: 10, centerSamplingFactor: 0.001, iterations: 10, returnFactor: 1.0, concurrency: 4, walkLength: 10, windowSize: 10, writeProperty: "embeddingNode2vec", inOutFactor: 1.0, contextSamplingExponent: 0.75, embeddingSize: 10, nodeLabels: [""], sudo: FALSE, minLearningRate: 1.0E-4, relationshipTypes: [""], walkBufferSize: 1000}

In the next section we’re going to explore these graph embeddings using visualization techniques.

Visualizing graph embeddings

We’re now going to explore the graph embeddings using the Python programming language, the Neo4j Python driver, and some popular Data Science libraries. We’ll create a scatterplot of the embedding and we want to see whether it’s possible to work out which town a country belongs to by looking at its embedding.

The code examples used in this section are available in Jupyter notebook form in the project repository.

The required libraries can be installed by running the following command:

pip install neo4j sklearn altair

Let’s create a file called roads.py and paste the following statements:

from neo4j import GraphDatabase
from sklearn.manifold import TSNE
import numpy as np
import altair as alt
import pandas as pd

driver = GraphDatabase.driver("bolt://localhost", auth=("neo4j", "neo"))

The first few lines import the required library and the last line creates a connection to the Neo4j database. You’ll need to change the Bolt URL and credentials to match that of your own database.

We’re going to use the driver to execute a Cypher query that returns the embedding for towns in the most popular countries, which are Spain, Great Britain, France, Turkey, Italy, Germany, and Greece. Restricting the number of countries will make it easier to detect any patterns once we start visualizing the data. Once the query has run, we’ll convert the results into a Pandas data frame:

with driver.session(database="neo4j") as session:
    result = session.run("""
    MATCH (p:Place)-[:IN_COUNTRY]->(country)
    WHERE country.code IN $countries
    RETURN p.name AS place, p.embeddingNode2vec AS embedding, country.code AS country
    """, {"countries": ["E", "GB", "F", "TR", "I", "D", "GR"]})
    X = pd.DataFrame([dict(record) for record in result])

Now we’re ready to start analyzing the data.

At the moment our embeddings are of size 10, but we need them to be of size 2 so that we can visualize them in 2 dimensions. The t-SNE algorithm is a dimensionality reduction technique that reduces high dimensionality objects to 2 or 3 dimensions so that they can be better visualized. We’re going to use it to create x and y coordinates for each embedding.

The following code snippet applies t-SNE to the embeddings and then creates a data frame containing each place, its country, as well as x and y coordinates.

X_embedded = TSNE(n_components=2, random_state=6).fit_transform(list(X.embedding))

places = X.place
df = pd.DataFrame(data = {
    "place": places,
    "country": X.country,
    "x": [value[0] for value in X_embedded],
    "y": [value[1] for value in X_embedded]
})

The content of the data frame is as follows:

Table 3. Results
place country x y

Larne

GB

23.597162

-3.478853

Belfast

GB

23.132071

-4.331254

La Coruña

E

-6.959006

7.212301

Pontevedra

E

-6.563524

7.505499

Huelva

E

-11.583806

11.094340

We can run the following code to create a scatterplot of our embeddings:

alt.Chart(df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color='country',
    tooltip=['place', 'country']
).properties(width=700, height=400)

From a quick visual inspection of this chart we can see that the embeddings seem to have clustered by country.

Next Steps

Visualizing embeddings is often only an intermediate step in our analysis. If we’re satisfied with the quality of the embeddings, we can use them for other tasks as well. The following are examples of other tasks that we can do with our embeddings:

  • Cluster nodes based on the similarity of their embeddings using a k-means clustering algorithm

  • Predict the country of town by using a nearest neighbors algorithm that takes embeddings as input

  • Use the embeddings as features for a machine learning algorithm

User Contributed Notes