Machine learning pipeline

This example is a simplified version of the Link Prediction pipeline described in the Machine learning section.

Create the graph

The following Cypher query creates the graph of a small social network in the Neo4j database.

CREATE
  (alice:Person {name: 'Alice', age: 38}),
  (michael:Person {name: 'Michael', age: 67}),
  (karin:Person {name: 'Karin', age: 30}),
  (chris:Person {name: 'Chris', age: 52}),
  (will:Person {name: 'Will', age: 6}),
  (mark:Person {name: 'Mark', age: 32}),
  (greg:Person {name: 'Greg', age: 29}),
  (veselin:Person {name: 'Veselin', age: 3}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin)

The graph looks as follows:

LP example data.

The next query creates an in-memory graph called friends from the Neo4j graph. Since the Link Prediction model requires the graph to be undirected, the orientation of the :KNOWS relationship is discarded.

CALL gds.graph.project(
  'friends',
  {
    Person: {
      properties: ['age']
    }
  },
  {
    KNOWS: {
      orientation: 'UNDIRECTED'
    }
  }
)

Configure the pipeline

You can configure a machine learning pipeline with a sequence of Cypher queries.

The following configuration is simplified for convenience. As such, the model performance is not expected to be the best.

CALL gds.beta.pipeline.linkPrediction.create('pipe');  (1)

CALL gds.beta.pipeline.linkPrediction.addFeature(  (2)
  'pipe',
  'cosine',
  {
    nodeProperties: ['age']
  }
);

CALL gds.beta.pipeline.linkPrediction.configureSplit(  (3)
  'pipe',
  {
    testFraction: 0.25,
    trainFraction: 0.6,
    validationFolds: 3
  }
);

CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('pipe');  (4)
1 Create the pipeline and add it to the pipeline catalog.
2 Add the link features (only age here) and a feature type (cosine here).
3 Configure the train-test split and the number of folds for cross-validation.
4 Add a model candidate (a logistic regression with no further configuration here).

Train a model

Once configured, the pipeline is ready to train a model. The training process returns the best performing model with the specified evaluation metrics.

CALL gds.beta.pipeline.linkPrediction.train(
  'friends',  (1)
  {
    pipeline: 'pipe',  (2)
    modelName: 'lp-pipeline-model',  (3)
    targetRelationshipType: 'KNOWS',  (4)
    metrics: ['AUCPR'],  (5)
  }
)
YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,  (6)
  modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,  (7)
  modelInfo.metrics.AUCPR.validation.avg AS avgValidationScore,
  modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
  modelInfo.metrics.AUCPR.test AS testScore
1 Name of the projected graph to use for training.
2 Name of the configured pipeline.
3 Name of the model to train.
4 Name of the relationship to train the model on.
5 Metrics used to evaluate the models (AUCPR here).
6 Parameters of the best performing model returned by the training process.
7 Evaluated metrics (here for AUCPR) of the best performing model returned by the training process.

Use the model for prediction

You can use the trained model to predict the probability that a link exists between two nodes in a projected graph.

CALL gds.beta.pipeline.linkPrediction.predict.stream(  (1)
  'friends',  (2)
  {
    modelName: 'lp-pipeline-model',  (3)
    topN: 5  (4)
  }
)
YIELD node1, node2, probability
RETURN
  gds.util.asNode(node1).name AS person1,
  gds.util.asNode(node2).name AS person2,
  probability
ORDER BY probability DESC, person1
1 Run the prediction in stream mode (return the predicted links as query results).
2 Name of the projected graph to run the prediction on.
3 Name of the model to use for prediction.
4 Maximum number of predicted relationships to output.

Next steps

Try to improve the performance of the training by using different model candidates, adding node properties to the features, or configuring autotuning.