Common usage

The GDS library usage pattern is typically split in two phases: development and production.

Development phase

The goal of the development phase is to establish a workflow of useful algorithms and machine learning pipelines. This phase involves configuring the system, defining graph projections, selecting the appropriate algorithms, and running machine learning experiments. It is typical to make use of the memory estimation features of the library. This enables you to successfully configure your system to handle the amount of data to be processed. There are three kinds of resources to keep in mind: the projected graph, the algorithm data structures, and the machine learning setup.

Machine learning pipelines

Developing a successful machine learning pipeline with Neo4j Graph Data Science typically involves experimenting with the following steps:

  • selecting training methods

  • selecting algorithms to produce graph features

  • selecting embedding algorithms to produce node embeddings

  • tuning parameters of training methods

  • tuning parameters of embedding algorithms

  • configuring pipeline training parameters

  • using graph sampling to train model candidates on data subsets

Production phase

In the production phase, the system is configured to run the desired algorithms and pipelines successfully and reliably. The sequence of operations would normally be one of:

  • project a graph → run one or more algorithms on the projection → consume results

  • project a graph → configure a machine learning pipeline → train a machine learning model

  • project a graph → compute predictions using a previously trained machine learning model

General considerations

The below image illustrates an overview of standard operation of the GDS library:

projected graph model

In this image, machine learning pipelines are included in the Algorithms category.

The GDS library runs its procedures greedily in terms of system resources. That means that each procedure will try to use:

  • as much memory as it needs (see Memory estimation)

  • as many CPU cores as it needs (not exceeding the limits of the concurrency it’s configured to run with)

Concurrently running procedures share the resources of the system hosting the DBMS and as such may affect each other’s performance. To get an overview of the status of the system you can use the System monitor procedure.

For more detail on the core operations in GDS, see the corresponding section: