Split Relationships

Introduction

The Split relationships algorithm is a utility algorithm that is used to pre-process a graph for model training. It splits the relationships into a holdout set and a remaining set. The holdout set is divided into two classes: positive, i.e., existing relationships, and negative, i.e., non-existing relationships. The class is indicated by a label property on the relationships. This enables the holdout set to be used for training or testing a machine learning model. Both, the holdout and the remaining relationships are added to the projected graph.

If the configuration option relationshipWeightProperty is specified, then the corresponding relationship property is preserved on the remaining set of relationships. Note however that the holdout set only has the label property; it is not possible to induce relationship weights on the holdout set as it also contains negative samples.

Syntax

This section covers the syntax used to execute the Split Relationships algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

Split Relationships syntax per mode

Run Split Relationships in mutate mode on a named graph.

CALL gds.splitRelationships.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  relationshipsWritten: Integer,
  configuration: Map

Table 1. Parameters
Name	Type	Default	Optional	Description
graphName	String	`n/a`	no	The name of a graph stored in the catalog.
configuration	Map	`{}`	yes	Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name	Type	Default	Optional	Description
sourceNodeLabels	List of String	`['*']`	yes	Filter the relationships where the sourceNode has at least one of the sourceNodeLabels.
targetNodeLabels	List of String	`['*']`	yes	Filter the relationships where the targetNode has at least one of the targetNodeLabels.
relationshipTypes	List of String	`['*']`	yes	Filter the named graph using the given relationship types.
concurrency	Integer	`4 ^[1]`	yes	The number of concurrent threads used for running the algorithm.
jobId	String	`Generated internally`	yes	An ID that can be provided to more easily track the algorithm’s progress.
logProgress	Boolean	`true`	yes	If disabled the progress percentage will not be logged.
holdoutFraction	Float	`n/a`	no	The fraction of valid relationships being used as holdout set. The remaining `1 - holdoutFraction` of the valid relationships are added to the remaining set.
negativeSamplingRatio	Float	`n/a`	no	The desired ratio of negative to positive samples in holdout set.
holdoutRelationshipType	String	`n/a`	no	Relationship type used for the holdout set. Each relationship has a property `label` indicating whether it is a positive or negative sample.
remainingRelationshipType	String	`n/a`	no	Relationships where one node has none of the source or target labels will be omitted. All invalid relationship are added to the remaining set.
nonNegativeRelationshipTypes	List of String	`n/a`	yes	Additional relationship types that are not used for negative sampling.
relationshipWeightProperty	String	`null`	yes	Name of the relationship property that is inherited by the `remainingRelationshipType`.
randomSeed	Integer	`n/a`	yes	An optional seed value for the random selection of relationships.
1. In a GDS Session, the default is the number of available processors.

Table 3. Results
Name	Type	Description
`preProcessingMillis`	Integer	Milliseconds for preprocessing the data.
`computeMillis`	Integer	Milliseconds for running the algorithm.
`mutateMillis`	Integer	Milliseconds for adding properties to the projected graph.
`relationshipsWritten`	Integer	The number of relationships created by the algorithm.
`configuration`	Map	The configuration used for running the algorithm.

Examples

All the examples below should be run in an empty database.

The examples use Cypher projections as the norm. Native projections will be deprecated in a future release.

In this section we will show examples of running the Split Relationships algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small graph of a handful nodes connected in a particular pattern. The example graph looks like this:

Consider the graph created by the following Cypher statement:

CREATE
    (n0:Label),
    (n1:Label),
    (n2:Label),
    (n3:Label),
    (n4:Label),
    (n5:Label),

    (n0)-[:TYPE { prop: 0} ]->(n1),
    (n1)-[:TYPE { prop: 1} ]->(n2),
    (n2)-[:TYPE { prop: 4} ]->(n3),
    (n3)-[:TYPE { prop: 9} ]->(n4),
    (n4)-[:TYPE { prop: 16} ]->(n5)

Given the above graph, we want to use 20% of the relationships as holdout set. The holdout set will be split into two same-sized classes: positive and negative. Positive relationships will be randomly selected from the existing relationships and marked with a property label: 1. Negative relationships will be randomly generated, i.e., they do not exist in the input graph, and are marked with a property label: 0.

MATCH (source:Label)-[r:TYPE]->(target:Label)
RETURN gds.graph.project(
  'graph',
  source,
  target,
  {
    sourceNodeLabels: ['Label'],
    targetNodeLabels: ['Label'],
    relationshipType: 'TYPE'
  },
  { undirectedRelationshipTypes: ['TYPE'] }
)

Now we can run the algorithm by specifying the appropriate ratio and the output relationship types. We use a random seed value in order to produce deterministic results.

CALL gds.splitRelationships.mutate('graph', {
    holdoutRelationshipType: 'TYPE_HOLDOUT',
    remainingRelationshipType: 'TYPE_REMAINING',
    holdoutFraction: 0.2,
    negativeSamplingRatio: 1.0,
    randomSeed: 1337
}) YIELD relationshipsWritten

Table 4. Results
relationshipsWritten
10

The input graph consists of 5 relationships. We use 20% (1 relationship) of the relationships to create the 'TYPE_HOLDOUT' relationship type (holdout set). This creates 1 relationship with positive label. Because of the negativeSamplingRatio, one relationship with negative label is also created. Finally, the TYPE_REMAINING relationship type is formed with the remaining 80% (4 relationships). These are written as orientation UNDIRECTED which counts as writing 8 relationships.

The mutated graph will look like the following graph when filtered by the TEST and TRAIN relationship.

CREATE
    (n0:Label),
    (n1:Label),
    (n2:Label),
    (n3:Label),
    (n4:Label),
    (n5:Label),

    (n2)-[:TYPE_HOLDOUT { label: 0 } ]->(n5), // negative, non-existing
    (n3)-[:TYPE_HOLDOUT { label: 1 } ]->(n2), // positive, existing

    (n0)<-[:TYPE_REMAINING { prop: 0} ]-(n1),
    (n1)<-[:TYPE_REMAINING { prop: 1} ]-(n2),
    (n3)<-[:TYPE_REMAINING { prop: 9} ]-(n4),
    (n4)<-[:TYPE_REMAINING { prop: 16} ]-(n5),
    (n0)-[:TYPE_REMAINING { prop: 0} ]->(n1),
    (n1)-[:TYPE_REMAINING { prop: 1} ]->(n2),
    (n3)-[:TYPE_REMAINING { prop: 9} ]->(n4),
    (n4)-[:TYPE_REMAINING { prop: 16} ]->(n5)