K-Means Clustering

Introduction

The K-Means Clustering algorithm is used to partition nodes into clusters based on their properties. The algorithm groups nodes such that nodes within the same cluster have similar property values, while nodes in different clusters are dissimilar.

K-Means operates by iteratively assigning nodes to the nearest cluster centroid and then recalculating the centroids based on the current cluster assignments. The algorithm requires specifying the number of clusters k in advance and works with numeric array properties such as coordinates, embeddings, or feature vectors.

The algorithm follows these steps until convergence or reaching the maximum number of iterations. First, it initializes k cluster centroids randomly or based on provided seed centroids. Then it assigns each node to the cluster with the nearest centroid based on Euclidean distance. Next, it recalculates each centroid as the mean of all nodes assigned to that cluster. The algorithm repeats the assignment and update steps until the cluster assignments stabilize or the maximum iterations are reached.

The number of iterations is limited by the configuration parameter maxIterations. The algorithm may stop earlier if the cluster assignments change by less than the threshold specified by the deltaThreshold parameter.

K-Means is particularly useful for customer segmentation, geographic clustering of locations, grouping similar entities in feature space, and as a preprocessing step for other machine learning tasks. The algorithm scales well to large datasets and provides interpretable results through the cluster centroids.

For more information on this algorithm, see:

Algorithm considerations

The K-Means algorithm has several important characteristics to consider. The number of clusters k must be specified in advance, which often requires domain knowledge or empirical methods like the elbow method to determine. Results can vary based on the initial centroid placement since the algorithm is sensitive to initialization. Running the algorithm multiple times with different random seeds via the numberOfRestarts parameter can help find better solutions.

The algorithm uses Euclidean distance as its metric, which assumes all dimensions are equally important and on comparable scales. Feature normalization or standardization may be necessary when dimensions have different ranges or units. K-Means does not always converge to the global optimum and instead finds local optima, which can differ between runs with different initializations.

The algorithm works best with spherical clusters of similar size and may struggle with elongated or irregularly shaped clusters. For evaluating cluster quality, the optional computeSilhouette parameter can be enabled, though this adds computational overhead as it requires calculating distances between all pairs of nodes.

Syntax

This section covers the syntax used to execute the K-Means algorithm.

Run K-Means Clustering.
CALL Neo4j_Graph_Analytics.graph.kmeans(
  'CPU_X64_XS',                    (1)
  {
    ['defaultTablePrefix': '...',] (2)
    'project': {...},              (3)
    'compute': {...},              (4)
    'write':   {...}               (5)
  }
);
1 Compute pool selector.
2 Optional prefix for table references.
3 Project config.
4 Compute config.
5 Write config.
Table 1. Parameters
Name Type Default Optional Description

computePoolSelector

String

n/a

no

The selector for the compute pool on which to run the K-Means job.

configuration

Map

{}

no

Configuration for graph project, algorithm compute and result write back.

The configuration map consists of the following three entries.

For more details on below Project configuration, refer to the Project documentation.
Table 2. Project configuration
Name Type

nodeTables

List of node tables.

relationshipTables

Map of relationship types to relationship tables.

Table 3. Compute configuration
Name Type Default Optional Description

nodeProperty

String

n/a

no

The name of the node property array to use for clustering. Must be a numeric array property representing coordinates or feature vectors.

k

Integer

10

yes

The number of clusters to create. Must be greater than 0 and less than or equal to the number of nodes in the graph.

maxIterations

Integer

10

yes

The maximum number of iterations to run. The algorithm will stop earlier if it converges before reaching this limit.

deltaThreshold

Float

0.05

yes

The convergence threshold as a percentage. If the relative change in cluster assignments falls below this value, the algorithm stops iterating.

numberOfRestarts

Integer

1

yes

The number of times to run K-Means with different random initializations. The best result with the lowest distortion is kept.

randomSeed

Integer

n/a

yes

The seed value to control the randomness of the algorithm. Setting this ensures reproducible results across multiple runs.

computeSilhouette

Boolean

false

yes

Whether to compute the silhouette score for evaluating cluster quality. Enabling this adds computational overhead.

seedCentroids

List

n/a

yes

Optional initial centroids as a list of coordinate arrays. When provided, k must match the number of provided centroids.

concurrency

Integer

4

yes

The number of concurrent threads used for computation.

For more details on below Write configuration, refer to the Write documentation.
Table 4. Write configuration
Name Type Default Optional Description

nodeLabel

String

n/a

no

Node label in the in-memory graph for nodes whose properties will be written back.

outputTable

String

n/a

no

Table in Snowflake database to which node properties with cluster assignments are written.

writeProperty

String

'community'

yes

The property name to use when writing the cluster assignment back to the database.

The K-Means algorithm operates on node properties only and does not use relationship information in the clustering computation. However, relationships can still be loaded as part of the graph projection if needed for other purposes.

Examples

In this section, we will show examples of running the K-Means algorithm on a concrete graph. K-Means operates on nodes with numeric array properties, clustering them based on their similarity in the feature space defined by those properties.

Consider the following graph of City nodes with geographic coordinates representing latitude and longitude.

Visualization of the example graph
CREATE OR REPLACE TABLE EXAMPLE_DB.DATA_SCHEMA.NODES_CITY (NODEID VARCHAR, COORDINATES ARRAY);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODES_CITY SELECT 'London',       ARRAY_CONSTRUCT(51.39148, -0.29825);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODES_CITY SELECT 'Manchester',   ARRAY_CONSTRUCT(53.41058, -2.97794);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODES_CITY SELECT 'Richmond',     ARRAY_CONSTRUCT(51.41259, -0.2974);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODES_CITY SELECT 'Plovdiv',      ARRAY_CONSTRUCT(42.68583, 26.32917);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODES_CITY SELECT 'Stockholm',    ARRAY_CONSTRUCT(59.36004, 18.00086);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODES_CITY SELECT 'Kristianstad', ARRAY_CONSTRUCT(56.28338, 13.27773);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODES_CITY SELECT 'Malmo',        ARRAY_CONSTRUCT(55.60587, 13.00073);

In this example, we want to use the K-Means algorithm to group cities based on their geographic coordinates. We will cluster the cities into 2 groups to identify cities that are geographically close to each other.

With the node table in Snowflake we can now project it as part of an algorithm job. In the following examples, we will demonstrate using the K-Means algorithm on this graph.

Run job

Running a K-Means job involves the three steps of Project, Compute and Write.

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

The following will run a K-Means job:
CALL Neo4j_Graph_Analytics.graph.kmeans('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': ['NODES_CITY'],
        'relationshipTables': {}
    },
    'compute': {
        'nodeProperty': 'COORDINATES',
        'k': 2,
        'randomSeed': 26,
        'maxIterations': 20,
    },
    'write': [{
        'nodeLabel': 'nodes_city',
        'outputTable': 'CITY_CLUSTERS'
    }]
});
Table 5. Results
JOB_ID JOB_STATUS JOB_START JOB_END JOB_RESULT

job_kmeans_abc123def456

SUCCESS

2025-10-22 15:30:45.123000

2025-10-22 15:30:52.456000

{
  "kmeans_1": {
    "averageDistanceToCentroid": 4.930111539422151,
    "averageSilhouette": 0,
    "centroids": [
      [
        52.07155,
        -1.1911966666666667
      ],
      [
        53.48378,
        17.6521225
      ]
    ],
    "communityDistribution": {
      "max": 4,
      "mean": 3.5,
      "min": 3,
      "p1": 3,
      "p10": 3,
      "p25": 3,
      "p5": 3,
      "p50": 3,
      "p75": 4,
      "p90": 4,
      "p95": 4,
      "p99": 4,
      "p999": 4
    },
    "computeMillis": 19,
    "configuration": {
      "computeSilhouette": false,
      "concurrency": 6,
      "deltaThreshold": 0.001,
      "initialSampler": "UNIFORM",
      "k": 2,
      "maxIterations": 20,
      "nodeLabels": [
        "*"
      ],
      "nodeProperty": "COORDINATES",
      "numberOfRestarts": 1,
      "randomSeed": 26,
      "relationshipTypes": [
        "*"
      ],
      "resultProperty": "community",
      "seedCentroids": []
    },
    "nodePropertiesWritten": 7
  },
  "project_1": {
    "graphName": "snowgraph",
    "nodeCount": 7,
    "nodeLabels": {
      "NODES_CITY": {
        "count": 7,
        "nodeId": {
          "dataType": "STRING"
        },
        "properties": {
          "COORDINATES": {
            "dataType": "DOUBLE_ARRAY"
          }
        },
        "table": "EXAMPLE_DB.DATA_SCHEMA.NODES_CITY"
      }
    },
    "nodeMillis": 138,
    "relationshipCount": 0,
    "relationshipMillis": 0,
    "relationshipTypes": {},
    "totalMillis": 138
  },
  "write_node_property_1": {
    "copyIntoTableMillis": 955,
    "nodeLabel": "NODES_CITY",
    "nodeProperty": "community",
    "outputTable": "EXAMPLE_DB.DATA_SCHEMA.OUTPUT",
    "rowsWritten": 7,
    "stageUploadMillis": 581,
    "writeMillis": 1743
  }
}

The returned result contains information about the job execution and clustering results. The ranIterations field shows that the algorithm ran for 3 iterations before converging, as indicated by the didConverge field being true. The communityDistribution provides statistics about the size of each cluster, showing that one cluster has 3 cities and the other has 4 cities.

The centroids field shows the final cluster center coordinates. The first centroid at coordinates [51.022935, 22.165015] represents the center of one cluster, while the second centroid at [53.62078, 4.540974] represents the center of the other cluster. These centroids can be interpreted as the average geographic location of cities in each cluster.

The cluster assignments for each city have been written back to the Snowflake database. We can query the results to see which cities belong to which cluster.

SELECT * FROM EXAMPLE_DB.DATA_SCHEMA.CITY_CLUSTERS ORDER BY COMMUNITY;
Table 6. Results
NODEID COMMUNITY

London

0

Manchester

0

Richmond

0

Plovdiv

1

Stockholm

1

Kristianstad

1

Malmo

1

The results show that the cities have been grouped into two clusters. Cities assigned to cluster 0 tend to be located in the United Kingdom , while cities in cluster 1 are all located in Eastern Europe and Scandinavia. This demonstrates how K-Means can effectively identify geographic patterns in the data based on coordinate similarity.

We use default values for most of the procedure configuration parameters. The randomSeed parameter is set to 26 to produce the same result on every invocation for reproducibility. The k parameter is set to 2 to create two distinct clusters, and maxIterations is set to 20 as an upper bound, though the algorithm converged earlier.

Using multiple restarts

Since K-Means can converge to different local optima depending on the initial centroid placement, running it multiple times with different random initializations can help find better solutions. The numberOfRestarts parameter controls how many times the algorithm runs with different initializations, keeping the result with the lowest total distortion.

The following will run K-Means with 5 different random initializations:
CALL Neo4j_Graph_Analytics.graph.kmeans('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': ['NODES_CITY'],
        'relationshipTables': {}
    },
    'compute': {
        'nodeProperty': 'COORDINATES',
        'k': 2,
        'maxIterations': 20,
        'numberOfRestarts': 5
    },
    'write': [{
        'nodeLabel': 'nodes_city',
        'outputTable': 'CITY_CLUSTERS_RESTARTS'
    }]
});
Table 7. Results
JOB_ID JOB_STATUS JOB_START JOB_END JOB_RESULT

job_kmeans_restarts_def789

SUCCESS

2025-10-22 15:45:22.123000

2025-10-22 15:45:30.456000

{
  "kmeans_1": {
    "averageDistanceToCentroid": 4.930111539422151,
    "averageSilhouette": 0,
    "centroids": [
      [
        52.07155,
        -1.1911966666666667
      ],
      [
        53.48378,
        17.6521225
      ]
    ],
    "communityDistribution": {
      "max": 4,
      "mean": 3.5,
      "min": 3,
      "p1": 3,
      "p10": 3,
      "p25": 3,
      "p5": 3,
      "p50": 3,
      "p75": 4,
      "p90": 4,
      "p95": 4,
      "p99": 4,
      "p999": 4
    },
    "computeMillis": 23,
    "configuration": {
      "computeSilhouette": false,
      "concurrency": 6,
      "deltaThreshold": 0.001,
      "initialSampler": "UNIFORM",
      "k": 2,
      "maxIterations": 20,
      "nodeLabels": [
        "*"
      ],
      "nodeProperty": "COORDINATES",
      "numberOfRestarts": 5,
      "randomSeed": 26,
      "relationshipTypes": [
        "*"
      ],
      "resultProperty": "community",
      "seedCentroids": []
    },
    "nodePropertiesWritten": 7
  },
  "project_1": {
    "graphName": "snowgraph",
    "nodeCount": 7,
    "nodeLabels": {
      "NODES_CITY": {
        "count": 7,
        "nodeId": {
          "dataType": "STRING"
        },
        "properties": {
          "COORDINATES": {
            "dataType": "DOUBLE_ARRAY"
          }
        },
        "table": "KRAKEN_DB.GDS.NODES_CITY"
      }
    },
    "nodeMillis": 179,
    "relationshipCount": 0,
    "relationshipMillis": 0,
    "relationshipTypes": {},
    "totalMillis": 179
  },
  "write_node_property_1": {
    "copyIntoTableMillis": 1254,
    "nodeLabel": "NODES_CITY",
    "nodeProperty": "community",
    "outputTable": "KRAKEN_DB.GDS.OUTPUT",
    "rowsWritten": 7,
    "stageUploadMillis": 515,
    "writeMillis": 2047
  }
}

The algorithm ran 5 times with different initializations and kept the best result with the lowest total distortion across all nodes. The computeMillis value of 23 milliseconds represents the time spent computing the K-Means algorithm itself, while the total job execution took 202 milliseconds (179 ms for projection + 23 ms for compute). This demonstrates that running multiple restarts doesn’t necessarily increase the compute time significantly, as the restarts are processed efficiently within the algorithm. The best clustering result from all restarts is what gets written to the output table.

Using more restarts increases computation time but can improve the quality of the clustering by avoiding poor local optima. This is especially useful when you don’t have domain knowledge to provide good initial centroids or when the data has complex structure that makes it difficult to find good clusters with a single run. For production use cases where cluster quality is critical, setting numberOfRestarts to 5 or 10 is a common practice.