Split Relationships

This section describes the Split Relationships algorithm in the Neo4j Graph Data Science library.

1. Introduction

The Split relationships algorithm is a utility algorithm that is used to pre-process a graph for model training. It splits the relationships into a holdout set and a remaining set. The holdout set is divided into two classes: positive, i.e., existing relationships, and negative, i.e., non-existing relationships. The class is indicated by a label property on the relationships. This enables the holdout set to be used for training or testing a machine learning model. Both, the holdout and the remaining relationships are added to the in-memory graph.

2. Syntax

This section covers the syntax used to execute the Split Relationships algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

Example 1. Split Relationships syntax per mode
Run Split Relationships in mutate mode on a named graph.
CALL gds.alpha.ml.splitRelationships.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  createMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  relationshipsWritten: Integer,
  configuration: Map
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. General configuration for algorithm execution on a named graph.
Name Type Default Optional Description

nodeLabels

String[]

['*']

yes

Filter the named graph using the given node labels.

relationshipTypes

String[]

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

Table 3. Algorithm specific configuration
Name Type Default Optional Description

holdoutFraction

Float

n/a

no

The fraction of all relationships being used as holdout set.

negativeSamplingRatio

Float

n/a

no

The desired ratio of negative to positive samples in holdout set.

holdoutRelationshipType

String

n/a

no

Relationship type used for the holdout set. Each relationship has a property label indicating whether it is a positive or negative sample.

remainingRelationshipType

String

n/a

no

Relationship type used for the remaining set.

nonNegativeRelationshipTypes

String[]

n/a

yes

Additional relationship types that are used for negative sampling.

randomSeed

Integer

n/a

yes

An optional seed value for the random selection of relationships.

Table 4. Results
Name Type Description

createMillis

Integer

Milliseconds for loading data.

computeMillis

Integer

Milliseconds for running the algorithm.

mutateMillis

Integer

Milliseconds for adding properties to the in-memory graph.

relationshipsWritten

Integer

The number of relationships created by the algorithm.

configuration

Map

The configuration used for running the algorithm.

3. Examples

In this section we will show examples of running the Split Relationships algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small graph of a handful nodes connected in a particular pattern. The example graph looks like this:

Visualization of the example graph

Consider the graph created by the following Cypher statement:

CREATE
    (n0:Label),
    (n1:Label),
    (n2:Label),
    (n3:Label),
    (n4:Label),
    (n5:Label),

    (n0)-[:TYPE]->(n1),
    (n1)-[:TYPE]->(n2),
    (n2)-[:TYPE]->(n3),
    (n3)-[:TYPE]->(n4),
    (n4)-[:TYPE]->(n5)

Given the above graph, we want to use 20% of the relationships as holdout set. The holdout set will be split into two same-sized classes: positive and negative. Positive relationships will be randomly selected from the existing relationships and marked with a property label: 1. Negative relationships will be randomly generated, i.e., they do not exist in the input graph, and are marked with a property label: 0.

CALL gds.graph.create(
    'graph',
    'Label',
    { TYPE: { orientation: 'UNDIRECTED' } }
)

Now we can run the algorithm by specifying the appropriate ratio and the output relationship types. We use a random seed value in order to produce deterministic results.

CALL gds.alpha.ml.splitRelationships.mutate('graph', {
    holdoutRelationshipType: 'TYPE_HOLDOUT',
    remainingRelationshipType: 'TYPE_REMAINING',
    holdoutFraction: 0.2,
    negativeSamplingRatio: 2.0,
    randomSeed: 1337
}) YIELD relationshipsWritten
Table 5. Results
relationshipsWritten

11

The input graph consists of 5 relationships. We use 20% (1 relationship) of the relationships to create the 'TYPE_HOLDOUT' relationship type (holdout set). This creates 1 relationship with positive label. Because of the negativeSamplingRatio, 2 relationships with negative label are also created. Finally, the TYPE_REMAINING relationship type is formed with the remaining 80% (4 relationships). These are written as orientation UNDIRECTED which counts as writing 8 relationships.

The mutated graph will look like the following graph when filtered by the TEST and TRAIN relationship.
CREATE
    (n0:Label),
    (n1:Label),
    (n2:Label),
    (n3:Label),
    (n4:Label),
    (n5:Label),

    (n2)-[:TYPE_HOLDOUT { label: 0 }]->(n5), // negative, non-existing
    (n3)-[:TYPE_HOLDOUT { label: 1 }]->(n2), // positive, existing

    (n0)<-[:TYPE_REMAINING]-(n1),
    (n1)<-[:TYPE_REMAINING]-(n2),
    (n3)<-[:TYPE_REMAINING]-(n4),
    (n4)<-[:TYPE_REMAINING]-(n5),
    (n0)-[:TYPE_REMAINING]->(n1),
    (n1)-[:TYPE_REMAINING]->(n2),
    (n3)-[:TYPE_REMAINING]->(n4),
    (n4)-[:TYPE_REMAINING]->(n5)