Tutorial: Applied Graph Embeddings
This guide provides a hands on walk through of the node2Vec graph embedding algorithm in the Neo4j Data Science Library.
Please have Neo4j (version 4.0 or later) and Graph Data Science Library (version 1.3 or later) downloaded and installed to use graph embeddings. You will also need to have Python installed to follow the second half of this guide.
Intermediate
Graph embeddings were introduced in version 1.3 of the Graph Data Science Library (GDSL). They can be used to create a fixed size vector representation for nodes in a graph. In this guide we’ll learn how to use these algorithms to generate embeddings and how to interpret them using visualization techniques.
The code examples used in this guide can be found in the neo4j-examples/graph-embeddings GitHub repository. For background reading about graph embeddings, see the Graph Embeddings Developer Guide. |
European Roads dataset
We’re going to use a dataset of European Roads compiled by Lasse Westh-Nielsen and described in more detail in his blog post. The dataset contains 894 towns and 1,250 roads connecting them. We also have the country to which a town belongs. Instructions for importing the data are in the project repository.
We can see the schema in the diagram below:
This diagram was generated by running CALL db.schema.visualization()
in the Neo4j Browser after importing the data.
In the next section we’re going to run graph embeddings over the towns and roads to generate a vector representation for each town.
Running graph embeddings
We’re going to use the node2vec algorithm, which is one of 3 embedding algorithms available in the Graph Data Science Library. node2vec creates embeddings based on biased random walks of a node’s neighborhood.
We’re going to first run the streaming version of this procedure, which returns a stream of node ids and embeddings. The algorithm has the following mandatory config:
nodeProjection
-
the node labels to use for our projected graph
relationshipProjection
-
the relationship types to use for our projected graph
embeddingSize
-
the size of the vector/list of numbers to create for each node
iterations
-
the number of iterations to run
We’ll also specify walkLength
, which defines the number of steps in each random walk.
The default value is 80, which feels too big for such a small graph.
We can run the algorithm with the following query:
CALL gds.alpha.node2vec.stream({
nodeProjection: "Place",
relationshipProjection: {
eroad: {
type: "EROAD",
orientation: "UNDIRECTED"
}
},
embeddingSize: 10,
iterations: 10,
walkLength: 10
})
YIELD nodeId, embedding
RETURN gds.util.asNode(nodeId).name AS place, embedding
LIMIT 5;
In relationshipProjection
, we specify orientation: "UNDIRECTED"
so that the direction of the EROAD
relationship type is ignored on the projected graph that the algorithm runs against.
If we run the query, it will return the following output:
place | embedding |
---|---|
"Larne" |
[2.1327764987945557, 1.0994384288787842, 0.49434158205986023, 2.040088176727295, 1.0031780004501343, 0.873113751411438, -2.413508176803589, -2.3154311180114746, 0.9832170009613037, -2.1525325775146484] |
"Belfast" |
[2.593827962875366, 0.7630942463874817, 0.7039147019386292, 2.3797214031219482, 0.7776350378990173, 0.5327662825584412, -2.3597097396850586, -1.9403077363967896, 1.2757759094238281, -1.681670904159546] |
"Dublin" |
[1.9023665189743042, 0.795767068862915, 0.722218930721283, 2.71242094039917, 0.21453168988227844, 0.3935716152191162, -2.1960527896881104, -2.6851491928100586, 1.0708848237991333, -0.6451507806777954] |
"Wexford" |
[2.0736780166625977, 1.1650514602661133, 0.4161956012248993, 2.8500888347625732, -0.12804043292999268, 0.355782151222229, -2.719728946685791, -2.6983509063720703, 0.5993242859840393, 0.265157550573349] |
"Rosslare" |
[1.8750609159469604, 1.2445515394210815, 0.630532443523407, 2.4588329792022705, 0.1135958656668663, 0.007978626526892185, -2.7186481952667236, -2.418004035949707, 0.3507269024848938, 0.9638977646827698] |
This procedure is non deterministic, so we’ll get different results each time that we run it. |
Everything looks fine so far, we’ve been successful in returning embeddings for each node.
Further exploration will be easier if we store the embeddings in Neo4j, so we’re going to do that using the write version of the procedure. We can store the embeddings by running the following query:
CALL gds.alpha.node2vec.write({
nodeProjection: "Place",
relationshipProjection: {
eroad: {
type: "EROAD",
orientation: "UNDIRECTED"
}
},
embeddingSize: 10,
iterations: 10,
walkLength: 10,
writeProperty: "embeddingNode2vec"
});
nodeCount | nodePropertiesWritten | createMillis | computeMillis | writeMillis | configuration |
---|---|---|---|---|---|
894 |
894 |
54 |
4388 |
57 |
{initialLearningRate: 0.025, writeConcurrency: 4, negativeSamplingRate: 5, walksPerNode: 10, centerSamplingFactor: 0.001, iterations: 10, returnFactor: 1.0, concurrency: 4, walkLength: 10, windowSize: 10, writeProperty: "embeddingNode2vec", inOutFactor: 1.0, contextSamplingExponent: 0.75, embeddingSize: 10, nodeLabels: [""], sudo: FALSE, minLearningRate: 1.0E-4, relationshipTypes: [""], walkBufferSize: 1000} |
In the next section we’re going to explore these graph embeddings using visualization techniques.
Visualizing graph embeddings
We’re now going to explore the graph embeddings using the Python programming language, the Neo4j Python driver, and some popular Data Science libraries. We’ll create a scatterplot of the embedding and we want to see whether it’s possible to work out which town a country belongs to by looking at its embedding.
The code examples used in this section are available in Jupyter notebook form in the project repository. |
The required libraries can be installed by running the following command:
pip install neo4j sklearn altair
Let’s create a file called roads.py
and paste the following statements:
from neo4j import GraphDatabase
from sklearn.manifold import TSNE
import numpy as np
import altair as alt
import pandas as pd
driver = GraphDatabase.driver("bolt://localhost", auth=("neo4j", "neo"))
The first few lines import the required library and the last line creates a connection to the Neo4j database. You’ll need to change the Bolt URL and credentials to match that of your own database.
We’re going to use the driver to execute a Cypher query that returns the embedding for towns in the most popular countries, which are Spain, Great Britain, France, Turkey, Italy, Germany, and Greece. Restricting the number of countries will make it easier to detect any patterns once we start visualizing the data. Once the query has run, we’ll convert the results into a Pandas data frame:
with driver.session(database="neo4j") as session:
result = session.run("""
MATCH (p:Place)-[:IN_COUNTRY]->(country)
WHERE country.code IN $countries
RETURN p.name AS place, p.embeddingNode2vec AS embedding, country.code AS country
""", {"countries": ["E", "GB", "F", "TR", "I", "D", "GR"]})
X = pd.DataFrame([dict(record) for record in result])
Now we’re ready to start analyzing the data.
At the moment our embeddings are of size 10, but we need them to be of size 2 so that we can visualize them in 2 dimensions. The t-SNE algorithm is a dimensionality reduction technique that reduces high dimensionality objects to 2 or 3 dimensions so that they can be better visualized. We’re going to use it to create x and y coordinates for each embedding.
The following code snippet applies t-SNE to the embeddings and then creates a data frame containing each place, its country, as well as x and y coordinates.
X_embedded = TSNE(n_components=2, random_state=6).fit_transform(list(X.embedding))
places = X.place
df = pd.DataFrame(data = {
"place": places,
"country": X.country,
"x": [value[0] for value in X_embedded],
"y": [value[1] for value in X_embedded]
})
The content of the data frame is as follows:
place | country | x | y |
---|---|---|---|
Larne |
GB |
23.597162 |
-3.478853 |
Belfast |
GB |
23.132071 |
-4.331254 |
La Coruña |
E |
-6.959006 |
7.212301 |
Pontevedra |
E |
-6.563524 |
7.505499 |
Huelva |
E |
-11.583806 |
11.094340 |
We can run the following code to create a scatterplot of our embeddings:
alt.Chart(df).mark_circle(size=60).encode(
x='x',
y='y',
color='country',
tooltip=['place', 'country']
).properties(width=700, height=400)
From a quick visual inspection of this chart we can see that the embeddings seem to have clustered by country.
Next Steps
Visualizing embeddings is often only an intermediate step in our analysis. If we’re satisfied with the quality of the embeddings, we can use them for other tasks as well. The following are examples of other tasks that we can do with our embeddings:
-
Cluster nodes based on the similarity of their embeddings using a k-means clustering algorithm
-
Predict the country of town by using a nearest neighbors algorithm that takes embeddings as input
-
Use the embeddings as features for a machine learning algorithm
Was this page helpful?
User Contributed Notes