Multilayer Perceptron

This feature is in the alpha tier. For more information on feature tiers, see Operations reference.

A Multilayer Perceptron (MLP) is a type of feed-forward neural network. It consists of multiple layers of connected neurons. The value of a neuron is computed by applying an activation function on the aggregated weighted inputs from previous layer. For classification, the size of the output layer is based on the number of classes. To optimize the weights of the network, GDS uses gradient descent with a Cross Entropy Loss.

1. Tuning the hyperparameters

In order to balance matters such as bias vs variance of the model, and speed vs memory consumption of the training, GDS exposes several hyperparameters that one can tune. Each of these are described below.

In Gradient descent based training, we try to find the best weights for our model. In each epoch we process all training examples to compute the loss and the gradient of the weights. These gradients are then used to update the weights. For the update we use the Adam optimizer as described in https://arxiv.org/pdf/1412.6980.pdf.

Statistics about the training are reported in the neo4j debug log.

1.1. Max Epochs

This parameter defines the maximum number of epochs for the training. Independent of the model’s quality, the training will terminate after these many epochs. Note, that the training can also stop earlier if the loss converged (see Patience and Tolerance.

Setting this parameter can be useful to limit the training time for a model. Restricting the computational budget can serve the purpose of regularization and mitigate overfitting, which becomes a risk with a large number of epochs.

1.2. Min Epochs

This parameter defines the minimum number of epochs for the training. Independent of the model’s quality, the training will at least run this many epochs.

Setting this parameter can be useful to avoid early stopping, but also increases the minimal training time of a model.

1.3. Patience

This parameter defines the maximum number of unproductive consecutive epochs. An epoch is unproductive if it does not improve the training loss by at least a tolerance fraction of the current loss.

Assuming the training ran for minEpochs, this parameter defines when the training converges.

Setting this parameter can lead to a more robust training and avoid early termination similar to minEpochs. However, a high patience can result in running more epochs than necessary.

In our experience, reasonable values for patience are in the range 1 to 3.

1.4. Tolerance

This parameter defines when an epoch is considered unproductive and together with patience defines the convergence criteria for the training. An epoch is unproductive if it does not improve the training loss by at least a tolerance fraction of the current loss.

A lower tolerance results in more sensitive training with a higher probability to train longer. A high tolerance means a less sensitive training and hence resulting in more epochs counted as unproductive.

1.5. Learning rate

When updating the weights, we move in the direction dictated by the Adam optimizer based on the loss function’s gradients. How much we move per weights update, you can configure via the learningRate parameter.

1.6. Batch size

This parameter defines how many training examples are grouped in a single batch.

The gradients are computed concurrently on the batches using concurrency many threads. At the end of an epoch the gradients are summed and scaled before updating the weights. The batchSize does not affect the model quality, but can be used to tune for training speed. A larger batchSize increases the memory consumption of the computation.

1.7. Penalty

This parameter defines the influence of the regularization term in the loss function. While the regularization can avoid overfitting, a high value can even lead to underfitting. The minimal value is zero, where the regularization term has no effect at all.

1.8. HiddenLayerSizes

This parameter defines the shape of the neural network. Each entry represents the number of neurons in a layer. The length of the list defines the number of hidden layers. Deeper and larger networks can theoretically approximate high degree surfaces better, at the expense of having more weights (and biases) that need to be trained.