Machine learning pipeline

This example is a simplified version of the Link Prediction pipeline described in the Machine learning section.

Create the graph

The following Cypher query creates the graph of a small social network in the Neo4j database.

CREATE
  (alice:Person {name: 'Alice', age: 38}),
  (michael:Person {name: 'Michael', age: 67}),
  (karin:Person {name: 'Karin', age: 30}),
  (chris:Person {name: 'Chris', age: 52}),
  (will:Person {name: 'Will', age: 6}),
  (mark:Person {name: 'Mark', age: 32}),
  (greg:Person {name: 'Greg', age: 29}),
  (veselin:Person {name: 'Veselin', age: 3}),

  (alice)-[:KNOWS]->(michael),
  (michael)-[:KNOWS]->(karin),
  (michael)-[:KNOWS]->(chris),
  (michael)-[:KNOWS]->(greg),
  (will)-[:KNOWS]->(michael),
  (will)-[:KNOWS]->(chris),
  (mark)-[:KNOWS]->(michael),
  (mark)-[:KNOWS]->(will),
  (greg)-[:KNOWS]->(chris),
  (veselin)-[:KNOWS]->(chris),
  (karin)-[:KNOWS]->(veselin),
  (chris)-[:KNOWS]->(karin)

The graph looks as follows:

The next query creates an in-memory graph called friends from the Neo4j graph. Since the Link Prediction model requires the graph to be undirected, the orientation of the :KNOWS relationship is discarded.

MATCH (source:Person)-[r:KNOWS]->(target:Person)
RETURN gds.graph.project(
  'friends',
  source,
  target,
  {
    sourceNodeProperties: source { .age },
    targetNodeProperties: target { .age },
    relationshipType: 'KNOWS'
  },
  { undirectedRelationshipTypes: ['KNOWS'] }
)

Configure the pipeline

You can configure a machine learning pipeline with a sequence of Cypher queries.

Create the pipeline and add it to the pipeline catalog:
```
CALL gds.beta.pipeline.linkPrediction.create('pipe')
```

Add the link features (only age here) and a feature type (l2 here):

CALL gds.beta.pipeline.linkPrediction.addFeature(
  'pipe',
  'l2',
  { nodeProperties: ['age'] }
)

Configure the train-test split and the number of folds for cross-validation:

CALL gds.beta.pipeline.linkPrediction.configureSplit(
  'pipe',
  {
    testFraction: 0.25,
    trainFraction: 0.6,
    validationFolds: 3
  }
)

Add a model candidate (a logistic regression with no further configuration here):
```
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('pipe')
```

Train a model

Once configured, the pipeline is ready to train a model. The training process returns the best performing model with the specified evaluation metrics.

The pipeline configuration shown in the previous section is simplified for convenience; as such, the model performance is not expected to be the best. See the Link prediction pipelines page for a detailed walkthrough.

CALL gds.beta.pipeline.linkPrediction.train(
  'friends',  (1)
  {
    pipeline: 'pipe',  (2)
    modelName: 'lp-pipeline-model',  (3)
    targetRelationshipType: 'KNOWS',  (4)
    metrics: ['AUCPR'],  (5)
    randomSeed: 42  (6)
  }
)
YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,  (7)
  modelInfo.metrics.AUCPR.train.avg AS avgTrainScore,  (8)
  modelInfo.metrics.AUCPR.validation.avg AS avgValidationScore,
  modelInfo.metrics.AUCPR.outerTrain AS outerTrainScore,
  modelInfo.metrics.AUCPR.test AS testScore

1	Name of the projected graph to use for training.
2	Name of the configured pipeline.
3	Name of the model to train.
4	Name of the relationship to train the model on.
5	Metrics used to evaluate the models (`AUCPR` here).
6	The random seed is only needed to obtain the same results across runs.
7	Parameters of the best performing model returned by the training process.
8	Evaluated metrics (here for `AUCPR`) of the best performing model returned by the training process.

Table 1. Results
winningModel	avgTrainScore	avgValidationScore	outerTrainScore	testScore
{batchSize=100, classWeights=[], focusWeight=0.0, learningRate=0.001, maxEpochs=100, methodName="LogisticRegression", minEpochs=1, patience=1, penalty=0.0, tolerance=0.001}	0.5740740741	0.3611111111	0.3784126984	0.3444444444

Use the model for prediction

You can use the trained model to predict the probability that a link exists between two nodes in a projected graph.

CALL gds.beta.pipeline.linkPrediction.predict.stream(  (1)
  'friends',  (2)
  {
    modelName: 'lp-pipeline-model',  (3)
    topN: 5  (4)
  }
)
YIELD node1, node2, probability
RETURN
  gds.util.asNode(node1).name AS person1,
  gds.util.asNode(node2).name AS person2,
  probability
ORDER BY probability DESC, person1

1	Run the prediction in `stream` mode (return the predicted links as query results).
2	Name of the projected graph to run the prediction on.
3	Name of the model to use for prediction.
4	Maximum number of predicted relationships to output.

Table 2. Results
person1	person2	probability
"Karin"	"Greg"	0.4991379664
"Mark"	"Karin"	0.4989714183
"Mark"	"Greg"	0.4986938388
"Will"	"Veselin"	0.4986938388
"Mark"	"Alice"	0.4971949275

Next steps

Try to improve the performance of the training by using different model candidates, adding node properties to the features, or configuring autotuning.