K-Means Clustering
The following content is a preview of an upcoming release. For documentation covering current releases, see https://neo4j.com/docs. |
This section describes the K-Means algorithm in the Neo4j Graph Data Science library.
1. Introduction
K-Means clustering is an unsupervised learning algorithm that is used to solve clustering problems.
It follows a simple procedure of classifying a given data set into a number of clusters, defined by the parameter k
.
The clusters are then positioned as points and all observations or data points are associated with the nearest cluster, computed, adjusted and then the process starts over using the new adjustments until a desired result is reached.
For more information on this algorithm, see:
2. Syntax
- Stream mode
- Stats mode
- Mutate mode
CALL gds.alpha.kmeans.stream(
graphName: String,
configuration: Map
)
YIELD
nodeId: Integer,
communityId: Integer
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
List of String |
|
yes |
Filter the named graph using the given node labels. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
nodeProperty |
String |
|
no |
A node property to be used by the algorithm. |
k |
Integer |
|
yes |
Number of desired clusters. |
maxIterations |
Integer |
|
yes |
The maximum number of iterations of K-Means to run. |
deltaThreshold |
Float |
|
yes |
Value as a percentage to determine when to stop early. If fewer than 'deltaThreshold * |nodes|' nodes change their cluster , the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). |
Name | Type | Description |
---|---|---|
nodeId |
Integer |
Node ID. |
communityId |
Integer |
The community ID. |
CALL gds.alpha.kmeans.stats(
graphName: String,
configuration: Map
)
YIELD
preProcessingMillis: Integer,
computeMillis: Integer,
postProcessingMillis: Integer,
communityDistribution: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
List of String |
|
yes |
Filter the named graph using the given node labels. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
nodeProperty |
String |
|
no |
A node property to be used by the algorithm. |
k |
Integer |
|
yes |
Number of desired clusters. |
maxIterations |
Integer |
|
yes |
The maximum number of iterations of K-Means to run. |
deltaThreshold |
Float |
|
yes |
Value as a percentage to determine when to stop early. If fewer than 'deltaThreshold * |nodes|' nodes change their cluster , the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). |
Name | Type | Description |
---|---|---|
preProcessingMillis |
Integer |
Milliseconds for preprocessing the data. |
computeMillis |
Integer |
Milliseconds for running the algorithm. |
postProcessingMillis |
Integer |
Milliseconds for computing percentiles and community count. |
communityDistribution |
Map |
Map containing min, max, mean as well as p50, p75, p90, p95, p99 and p999 percentile values of community size for the last level. |
configuration |
Map |
The configuration used for running the algorithm. |
CALL gds.alpha.kmeans.mutate(
graphName: String,
configuration: Map
)
YIELD
preProcessingMillis: Integer,
computeMillis: Integer,
mutateMillis: Integer,
postProcessingMillis: Integer,
nodePropertiesWritten: Integer,
communityDistribution: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
mutateProperty |
String |
|
no |
The node property in the GDS graph to which the cluster is written. |
List of String |
|
yes |
Filter the named graph using the given node labels. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
nodeProperty |
String |
|
no |
A node property to be used by the algorithm. |
k |
Integer |
|
yes |
Number of desired clusters. |
maxIterations |
Integer |
|
yes |
The maximum number of iterations of K-Means to run. |
deltaThreshold |
Float |
|
yes |
Value as a percentage to determine when to stop early. If fewer than 'deltaThreshold * |nodes|' nodes change their cluster , the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). |
Name | Type | Description |
---|---|---|
preProcessingMillis |
Integer |
Milliseconds for preprocessing the data. |
computeMillis |
Integer |
Milliseconds for running the algorithm. |
mutateMillis |
Integer |
Milliseconds for adding properties to the projected graph. |
postProcessingMillis |
Integer |
Milliseconds for computing percentiles and community count. |
nodePropertiesWritten |
Integer |
Number of properties added to the projected graph. |
communityDistribution |
Map |
Map containing min, max, mean as well as p50, p75, p90, p95, p99 and p999 percentile values of community size for the last level. |
configuration |
Map |
The configuration used for running the algorithm. |
3. Examples
In this section we will show examples of running the K-Means algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small cities graph of a handful nodes connected in a particular pattern. The example graph looks like this:
CREATE
(:City {name: 'Surbiton', coordinates: [51.39148, -0.29825]}),
(:City {name: 'Liverpool', coordinates: [53.41058, -2.97794]}),
(:City {name: 'Kingston upon Thames', coordinates: [51.41259, -0.2974]}),
(:City {name: 'Sliven', coordinates: [42.68583, 26.32917]}),
(:City {name: 'Solna', coordinates: [59.36004, 18.00086]}),
(:City {name: 'Örkelljunga', coordinates: [56.28338, 13.27773]}),
(:City {name: 'Malmö', coordinates: [55.60587, 13.00073]}),
(:City {name: 'Xánthi', coordinates: [41.13488, 24.888]});
This graph is composed of various City nodes, in three global locations - United Kingdom, Sweden and the Balkan region in Europe.
We can now project the graph and store it in the graph catalog.
We load the City
node label with coordinates
node property.
In the examples below we will use named graphs and native projections as the norm. However, Cypher projections can also be used. |
CALL gds.graph.project(
'cities',
{
City: {
properties: 'coordinates'
}
},
'*'
)
In the following examples we will demonstrate using the K-Means algorithm on this graph to find communities of cities that are close to each other geographically.
3.1. Stream
In the stream
execution mode, the algorithm returns the cluster for each node.
This allows us to inspect the results directly or post-process them in Cypher without any side effects.
For more details on the stream
mode in general, see Stream.
CALL gds.alpha.kmeans.stream('cities', {
nodeProperty: 'coordinates',
k: 3,
randomSeed: 42
})
YIELD nodeId, communityId
RETURN gds.util.asNode(nodeId).name AS name, communityId
ORDER BY communityId, name ASC
name | communityId |
---|---|
"Kingston upon Thames" |
0 |
"Liverpool" |
0 |
"Surbiton" |
0 |
"Sliven" |
1 |
"Xánthi" |
1 |
"Malmö" |
2 |
"Solna" |
2 |
"Örkelljunga" |
2 |
In the example above we can see that the cities are geographically clustered together.
3.2. Stats
In the stats
execution mode, the algorithm returns a single row containing a summary of the algorithm result.
This execution mode does not have any side effects.
It can be useful for evaluating algorithm performance by inspecting the computeMillis
return item.
In the examples below we will omit returning the timings.
The full signature of the procedure can be found in the syntax section.
For more details on the stats
mode in general, see Stats.
CALL gds.alpha.kmeans.stats('cities', {
nodeProperty: 'coordinates',
k: 3,
randomSeed: 42
})
YIELD communityDistribution
communityDistribution |
---|
{ "p99": 3, "min": 2, "max": 3, "mean": 2.6666666666666665, "p90": 3, "p50": 3, "p999": 3, "p95": 3, "p75": 3 } |
3.3. Mutate
The mutate
execution mode extends the stats
mode with an important side effect: updating the named graph with a new node property containing the cluster for that node.
The name of the new property is specified using the mandatory configuration parameter mutateProperty
.
The result is a single summary row, similar to stats
, but with some additional metrics.
The mutate
mode is especially useful when multiple algorithms are used in conjunction.
For more details on the mutate
mode in general, see Mutate.
cities
graph:CALL gds.alpha.kmeans.mutate('cities', {
nodeProperty: 'coordinates',
k: 3,
randomSeed: 42,
mutateProperty: 'kmeans'
})
YIELD communityDistribution
communityDistribution |
---|
{ "p99": 3, "min": 2, "max": 3, "mean": 2.6666666666666665, "p90": 3, "p50": 3, "p999": 3, "p95": 3, "p75": 3 } |
In mutate
mode, only a single row is returned by the procedure.
The result is written to the GDS in-memory graph instead of the Neo4j database.
Was this page helpful?