About this module

In this module you will learn how to build a Machine Learning classifier to predict co-authorships in the citation graph.

At the end of this module, you should be able to:

  • Describe what link prediction is.

  • Use the link prediction graph algorithms in Neo4j.

  • Understand the challenges when building Machine Learning models on graph data.

  • Build a link prediction classifier using scikit-learn with features derived from the Neo4j Graph Data Science library.

Link Prediction has been around for a long time, but was popularised by a paper written by Jon Kleinberg and David Liben-Nowell in 2004, titled The Link Prediction Problem for Social Networks.

Link Prediction

Kleinberg and Liben-Nowell approached this problem from the perspective of social networks, asking this question:

Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future?

We formalize this question as the Link Prediction problem, and develop approaches to Link Prediction based on measures for analyzing the “proximity” of nodes in a network.

For example, we could predict future associations between:

  • People in a terrorist network.

  • Molecules in a biology network.

  • Potential co-authorships in a citation network.

  • Interest in an artist or artwork.

In each these examples, predicting a link means that we are predicting some future behaviour. For example in a citation network, we’re actually predicting the action of two people collaborating on a paper.

Kleinberg and Liben-Nowell describe a set of methods that can be used for Link Prediction. These methods compute a score for a pair of nodes, where the score could be considered a measure of proximity or “similarity” between those nodes based on the graph topology. The closer two nodes are, the more likely there will be a relationship between them.

You will gain some experience running the Link Prediction algorithms. In the query edit pane of Neo4j Browser, execute the browser command: :play gds-data-science-exercises and follow the instructions for the Link Prediction exercise.

Now that you have learned how to execute the link prediction algorithms, you will learn what to do with the results. There are two approaches:

  • Using measures directly

  • Supervised learning

Using the measures directly

You can use the scores from the link prediction algorithms directly. With this approach, you set a threshold value above which the algorithm would predict that a pair of nodes will have a link.

For example, you might say that every pair of nodes that has a preferential attachment score above 3 would have a link, and any with 3 or less would not.

Supervised learning

You can take a supervised learning approach where you use the scores as features to train a binary classifier. The binary classifier then predicts whether a pair of nodes will have a link.

In the next part of this module you will use the supervised learning approach.

Exercise 2: Building a binary classifier

In this exercise, you will build a binary classifier to predict co-authorships using a notebook.

Launch the 04_Predictions.ipynb notebook and follow the steps in this exercise.

Check your understanding

Question 1

Which Link Prediction algorithm "captures the notion that two strangers who have a common friend may be introduced by that friend."?

Select the correct answer.

  • Adamic Adar

  • Common Neighbors

  • PageRank

  • Preferential Attachment

Question 2

Which of these challenges do we need to address when building a binary classifier for Link Prediction?

Select the correct answers.

  • Class Imbalance

  • Clustering cut-off

  • Data Leakage

  • Damping factor

Question 3

Which feature is the most important in our final model?

Select the correct answer.

  • Preferential Attachment

  • Triangles (min)

  • Common neighbors

  • Louvain


You should now be able to:

  • Describe what Link Prediction is.

  • Use the Link Prediction algorithms in Neo4j.

  • Understand the challenges when building Machine Learning models on graph data.

  • Build a Link Prediction classifier using scikit-learn with features derived from the Neo4j Data Science library.