GDS with Neo4j cluster

This feature is not available in AuraDS.

It is possible to run GDS as part of Neo4j cluster deployment. Since GDS performs large computations with the full resources of the system it is not suitable to run on instances that serve the transactional workload of the cluster.

Deployment

We make use of a Secondary instance to deploy the GDS library and process analytical workloads. Calls to GDS write procedures are internally directed via server-side routing to the cluster instance that is a Writer for the database we work on.

Neo4j 5.x supports different databases on the same cluster instance to act as Primary or Secondary members of the cluster. In order for GDS to function, all databases on the instance it is installed have to be Secondary, including the system database (see server.cluster.system_database_mode and initial.server.mode_constraint). GDS has compute-intensive OLAP workloads that may disrupt the cluster operations and we recommend GDS to be installed on an instance that is not serving transactional load and does not participate in Leader elections.

Please refer to the official Neo4j documentation for details on how to set up a Neo4j analytics cluster. Note that the link points to the latest Neo4j version documentation and the configuration settings may differ from earlier versions.

  • The cluster must contain at least one Secondary machine

    • single Primary and a Secondary is a valid scenario.

    • GDS workloads are not load-balanced if there are more than one Secondary instances.

  • Cluster should be configured to use server-side routing.

  • GDS plugin deployed on the Secondary.

    • A valid GDS Enterprise Edition license must be installed and configured on the Secondary.

    • The driver connection to operated GDS should be made using the bolt:// protocol to the Secondary instance.

For more information on setting up, configuring and managing a Neo4j cluster, please refer to the documentation.

When working with cluster configuration you should beware strict config validation in Neo4j.

When configuring GDS for a Secondary instance you will introduce GDS-specific configuration into neo4j.conf - and that is fine because with the GDS plugin installed, Neo4j will happily validate those configuration items.

However, you might not be able to reuse that same configuration file verbatim on the core cluster members, because there you will not install GDS plugin, and thus Neo4j will not be able to validate the GDS-specific configuration items. And validation failure would mean Neo4j would refuse to start.

It is of course also possible to turn strict validation off.

We make use of a Read Replica instance to deploy the GDS library and process analytical workloads. Calls to GDS write procedures are internally directed to the cluster LEADER instance via server-side routing.

Please refer to the official Neo4j documentation for details on how to setup Neo4j Causal Cluster. Note that the link points to the latest Neo4j 4.x version documentation and the configuration settings may differ from earlier versions.

  • The cluster must contain at least one Read Replica instance

    • single Core member and a Read Replica is a valid scenario.

    • GDS workloads are not load-balanced if there are more than one Read Replica instances.

  • Cluster should be configured to use server-side routing.

  • GDS plugin deployed on the Read Replica.

    • A valid GDS Enterprise Edition license must be installed and configured on the Read Replica.

    • The driver connection to operated GDS should be made using the bolt:// protocol, or server-policy routed to the Read Replica instance.

For more information on setting up, configuring and managing a Neo4j Causal Clustering, please refer to the documentation.

GDS Configuration

The following optional settings can be used to control transaction size.

Property Default

gds.cluster.tx.min.size

10000

gds.cluster.tx.max.size

100000

The batch size for writing node properties is computed using both values along with the configured concurrency and total node count. The batch size for writing relationship is using the lower value of the two settings. There are some procedures that support batch size configuration which takes precedence if present in procedure call parameters.