Essential metrics

This chapter describes some essential metrics to monitor in Neo4j.

To ensure your applications are running smoothly, it is good to monitor:

  • The server load — the strain on the machine hosting Neo4j.

  • The Neo4j load — the strain on Neo4j.

  • The cluster health — to ensure the cluster is working as expected.

  • The workload of a Neo4j instance.

Reading the Performance section is recommended to better understand the metrics.

1. Server load metrics

Monitoring the hardware resources shows the strain on the server running Neo4j.

You can use utilities, such as the collectd daemon or systemd on Linux, to gather information about the system. These metrics can help with capacity planning as your workload grows.

Metric name Description

CPU usage

If this is reaching 100%, you may need additional CPU capacity.

Used memory

This metric tells you if you are close to using all available memory on the server. Make sure your peaks are at 95% or below to reduce the risk of running out of memory. For more information, see Memory configuration.

Free disk space

Observe the rate of your data growth so you can plan for additional storage before you run out. This applies to all disks that Neo4j is writing to. You might also choose to write the log files to a different disk.

2. Neo4j load metrics

The Neo4j load metrics monitor the strain that Neo4j is being put under. They can help with capacity planning.

Metric name Metric Description

Heap usage

vm.heap.used and vm.heap.total

If Neo4j consistently uses 100% of the heap, increase the initial and max heap size. For more information, see Memory configuration.

Page cache

page_cache.hit_ratio and page_cache.usage_ratio

When a request misses the page cache, the data must be fetched from a much slower disk. Ideally, the hit_ratio should be above 98% most of the time. This shows how much of the allocated memory to the page cache is used. If this is at 100%, consider increasing the page cache size.

JVM garbage collection

vm.gc.time.%s

The proportion of time the JVM spends reclaiming the heap instead of doing other work. This metric can spike when the database is running low on memory. If this happens, it can halt processing and cause query execution errors. Consider increasing the size of your database if this appears to be the case.

Checkpoint time

neo4j.<db>.check_point.duration

You should monitor the checkpoint duration to ensure it does not start to approach the interval between checkpoints. If this happens, consider the following steps to improve checkpointing performance:

3. Neo4j cluster health metrics

The cluster health metrics indicate the health of a cluster member at a glance. It is important to know which instance is the leader. The leader has a different load pattern from the followers, which should exhibit similar load patterns.

Metric name Metric Description

Leader

neo4j.causal_clustering.core.is_leader

Track this for each Core cluster member. It will report 0 if it is not the leader and 1 if it is the leader. The sum of all of these should always be 1. However, there will be transient periods in which the sum can be more than 1 because more than one member thinks it is the leader.

Transaction workload

neo4j.transaction.last_committed_tx_id

The ID of the last committed transaction. Track this for each Neo4j instance. In a cluster setup, track this for each Core cluster member and Read Replica. It might break into separate charts. It should show one line, ever increasing, and if one of the lines levels off or falls behind, it is clear that this member is no longer replicating data, and action is needed to rectify the situation.

4. Workload metrics

These metrics are useful for monitoring the workload of a Neo4j instance. The absolute values of these depend on the sort of workload you expect.

Metric name Metric Description

Bolt connections

<prefix>.bolt.connections_running

The number of connections that are currently executing cypher and returning results.

Total nodes/relationships

neo4j.count.node and neo4j.count.relationship

(Not enabled by default) Total number of distinct relationship types. Total number of distinct property names. Total number of relationships. Total number of nodes.

Throughput

<db>.db.query.execution.latency.millis

This metric produces a histogram of 99th and 95th percentile transaction latencies. Useful for identifying spikes or increases in the data load.

For the full list of all available metrics in Neo4j, see Metrics reference.