# Logistic regression

This feature is in the beta tier. For more information on feature tiers, see API Tiers.

Logistic regression is a fundamental supervised machine learning classification method. This trains a model by minimizing a loss function which depends on a weight matrix and on the training data. The loss can be minimized for example using gradient descent. In GDS we use the Adam optimizer which is a gradient descent type algorithm.

The weights are in the form of a `[c,d]`

sized matrix `W`

and a bias vector `b`

of length `c`

, where `d`

is the feature dimension and `c`

is equal to the number of classes.
The loss function is then defined as:

`CE(softmax(Wx + b))`

where `CE`

is the cross entropy loss, `softmax`

is the softmax function, and `x`

is a feature vector training sample of length `d`

.

To avoid overfitting one may also add a regularization term to the loss.
Neo4j Graph Data Science supports the option of `l2`

regularization which can be configured using the `penalty`

parameter.

## Tuning the hyperparameters

In order to balance matters such as bias vs variance of the model, and speed vs memory consumption of the training, GDS exposes several hyperparameters that one can tune. Each of these are described below.

In Gradient descent based training, we try to find the best weights for our model. In each epoch we process all training examples to compute the loss and the gradient of the weights. These gradients are then used to update the weights. For the update we use the Adam optimizer as described in https://arxiv.org/pdf/1412.6980.pdf.

Statistics about the training are reported in the neo4j debug log.

### Max Epochs

This parameter defines the maximum number of epochs for the training. Independent of the model’s quality, the training will terminate after these many epochs. Note, that the training can also stop earlier if the loss converged (see Patience and Tolerance.

Setting this parameter can be useful to limit the training time for a model. Restricting the computational budget can serve the purpose of regularization and mitigate overfitting, which becomes a risk with a large number of epochs.

### Min Epochs

This parameter defines the minimum number of epochs for the training. Independent of the model’s quality, the training will at least run this many epochs.

Setting this parameter can be useful to avoid early stopping, but also increases the minimal training time of a model.

### Patience

This parameter defines the maximum number of unproductive consecutive epochs.
An epoch is unproductive if it does not improve the training loss by at least a `tolerance`

fraction of the current loss.

Assuming the training ran for `minEpochs`

, this parameter defines when the training converges.

Setting this parameter can lead to a more robust training and avoid early termination similar to `minEpochs`

.
However, a high patience can result in running more epochs than necessary.

In our experience, reasonable values for `patience`

are in the range `1`

to `3`

.

### Tolerance

This parameter defines when an epoch is considered unproductive and together with `patience`

defines the convergence criteria for the training.
An epoch is unproductive if it does not improve the training loss by at least a `tolerance`

fraction of the current loss.

A lower tolerance results in more sensitive training with a higher probability to train longer. A high tolerance means a less sensitive training and hence resulting in more epochs counted as unproductive.

### Learning rate

When updating the weights, we move in the direction dictated by the Adam optimizer based on the loss function’s gradients.
How much we move per weights update, you can configure via the `learningRate`

parameter.

### Batch size

This parameter defines how many training examples are grouped in a single batch.

The gradients are computed concurrently on the batches using `concurrency`

many threads.
At the end of an epoch the gradients are summed and scaled before updating the weights.
The `batchSize`

does not affect the model quality, but can be used to tune for training speed.
A larger batchSize increases the memory consumption of the computation.

### Penalty

This parameter defines the influence of the regularization term in the loss function. While the regularization can avoid overfitting, a high value can even lead to underfitting. The minimal value is zero, where the regularization term has no effect at all.

### Class weights

This parameter introduces the concept of *class weights*, studied in 'Focal Loss for Dense Object Detection' by T. Lin et al.
It is often called *balanced cross entropy*.
It assigns a weight to each class in the cross-entropy loss function, thus allowing the model to treat different classes with varying importance.
It is defined for each example as:

where `a`

denotes the class weight of the true class._{t}`p`

denotes the probability of the true class._{t}

For class-imbalanced problems, the class weights are often set to the inverse of class frequencies to improve the inductive bias of the model on minority classes.

#### Usage in link prediction

For link prediction, it must be a list of length 2 where the first weight is for negative examples (missing relationships) and the second for positive examples (actual relationships).

#### Usage in node classification

For node classification, the `i`

weight is for the ^{th}`i`

class, ordered by the class values (which must be integers). For example, if your node classification dataset has three classes: 0, 1, 42. Then the class weights must be of length 3. The third weight is applied to class 42.^{th}

### Focus weight

This parameter introduces the concept of *focal loss*, again studied in 'Focal Loss for Dense Object Detection' by T. Lin et al.
When `focusWeight`

is a value greater than zero, the loss function changes from standard Cross-Entropy Loss to Focal Loss.
It is defined for each example as:

where `p`

denotes the probability of the true class.
The _{t}`focusWeight`

parameter is the exponent noted as `g`

.

Increasing `focusWeight`

will guide the model towards trying to fit "hard" misclassified examples.
A hard misclassified example is an example for which the model has a low predicted probability for the true class.
In the above equation, the loss will be exponentially higher for low-true-class-probability examples, thus tuning the model towards trying to fit them, at the expense of potentially being less confident on "easy" examples.

In class-imbalanced datasets, the minority class(es) are typically harder to classify correctly. Read more about class imbalance for Link Prediction in Class Imbalance.