Link prediction pipelines

This algorithm is in the beta tier. For more information on algorithm tiers, see Graph algorithms.

Link prediction is a common machine learning task applied to graphs: training a model to learn, between pairs of nodes in a graph, where relationships should exist. More precisely, the input to the machine learning model are examples of node pairs. During training, the node pairs are labeled as adjacent or not adjacent.

In GDS, we have Link prediction pipelines which offer an end-to-end workflow, from feature extraction to link prediction. The training pipelines reside in the pipeline catalog. When a training pipeline is executed, a prediction model is created and stored in the model catalog.

A training pipeline is a sequence of three phases:

  1. From the graph three sets of node pairs are derived: feature set, training set, test set. The latter two are labeled.

  2. The nodes in the graph are augmented with new properties by running a series of steps on the graph with only relationships from the feature set.

  3. The train and test sets are used for training a link prediction pipeline. Link features are derived by combining node properties of node pairs.

For the training and test sets, positive examples are selected from the relationships in the graph. The negative examples are sampled from non-adjacent nodes.

One can configure which steps should be included above. The steps execute GDS algorithms that create new node properties. After configuring the node property steps, one can define how to combine node properties of node pairs into link features. The training phase (III) trains multiple model candidates using cross-validation, selects the best one, and reports relevant performance metrics.

After training the pipeline, a prediction model is created. This model includes the node property steps and link feature steps from the training pipeline and uses them to generate the relevant features for predicting new relationships. The prediction model can be applied to infer the probability of the existence of a relationship between two non-adjacent nodes.

Prediction can only be done with a prediction model (not with a training pipeline).

This segment is divided into the following pages: