Logistic regression

This feature is in the beta tier. For more information on feature tiers, see Operations reference.

Logistic regression is a fundamental supervised machine learning classification method. This trains a model by minimizing a loss function which depends on a weight matrix and on the training data. The loss can be minimized for example using gradient descent. In GDS we use the Adam optimizer which is a gradient descent type algorithm.

The weights are in the form of a [c,d] sized matrix W and a bias vector b of length c, where d is the feature dimension and c is equal to the number of classes. The loss function is then defined as:

CE(softmax(Wx + b))

where CE is the cross entropy loss, softmax is the softmax function, and x is a feature vector training sample of length d.

To avoid overfitting one may also add a regularization term to the loss. Neo4j Graph Data Science supports the option of l2 regularization which can be configured using the penalty parameter.

1. Tuning the hyperparameters

In order to balance matters such as bias vs variance of the model, and speed vs memory consumption of the training, GDS exposes several hyperparameters that one can tune. Each of these are described below.

In Gradient descent based training, we try to find the best weights for our model. In each epoch we process all training examples to compute the loss and the gradient of the weights. These gradients are then used to update the weights. For the update we use the Adam optimizer as described in https://arxiv.org/pdf/1412.6980.pdf.

Statistics about the training are reported in the neo4j debug log.

1.1. Max Epochs

This parameter defines the maximum number of epochs for the training. Independent of the model’s quality, the training will terminate after these many epochs. Note, that the training can also stop earlier if the loss converged (see Patience and Tolerance.

Setting this parameter can be useful to limit the training time for a model. Restricting the computational budget can serve the purpose of regularization and mitigate overfitting, which becomes a risk with a large number of epochs.

1.2. Min Epochs

This parameter defines the minimum number of epochs for the training. Independent of the model’s quality, the training will at least run this many epochs.

Setting this parameter can be useful to avoid early stopping, but also increases the minimal training time of a model.

1.3. Patience

This parameter defines the maximum number of unproductive consecutive epochs. An epoch is unproductive if it does not improve the training loss by at least a tolerance fraction of the current loss.

Assuming the training ran for minEpochs, this parameter defines when the training converges.

Setting this parameter can lead to a more robust training and avoid early termination similar to minEpochs. However, a high patience can result in running more epochs than necessary.

In our experience, reasonable values for patience are in the range 1 to 3.

1.4. Tolerance

This parameter defines when an epoch is considered unproductive and together with patience defines the convergence criteria for the training. An epoch is unproductive if it does not improve the training loss by at least a tolerance fraction of the current loss.

A lower tolerance results in more sensitive training with a higher probability to train longer. A high tolerance means a less sensitive training and hence resulting in more epochs counted as unproductive.

1.5. Learning rate

When updating the weights, we move in the direction dictated by the Adam optimizer based on the loss function’s gradients. How much we move per weights update, you can configure via the learningRate parameter.

1.6. Batch size

This parameter defines how many training examples are grouped in a single batch.

The gradients are computed concurrently on the batches using concurrency many threads. At the end of an epoch the gradients are summed and scaled before updating the weights. The batchSize does not affect the model quality, but can be used to tune for training speed. A larger batchSize increases the memory consumption of the computation.

1.7. Penalty

This parameter defines the influence of the regularization term in the loss function. While the regularization can avoid overfitting, a high value can even lead to underfitting. The minimal value is zero, where the regularization term has no effect at all.