3.5.3. Statistics

3.5.3.1. Introduction

When you issue a Cypher query, it gets compiled to an execution plan (see Section 3.7, “Execution plans”) that can run and answer your question. To produce an efficient plan for your query, Neo4j needs information about your database, such as the schema — what indexes and constraints do exist? Neo4j will also use statistical information it keeps about your database to optimize the execution plan. With this information, Neo4j can decide which access pattern leads to the best performing plans.

The statistical information that Neo4j keeps is:

  1. The number of nodes with a certain label.
  2. Selectivity per index.
  3. The number of relationships by type.
  4. The number of relationships by type, ending or starting from a node with a specific label.

Neo4j keeps the statistics up to date in two different ways. For label counts for example, the number is updated whenever you set or remove a label from a node. For indexes, Neo4j needs to scan the full index to produce the selectivity number. Since this is potentially a very time-consuming operation, these numbers are collected in the background when enough data on the index has been changed.

3.5.3.2. Configuration options

Execution plans are cached and will not be replanned until the statistical information used to produce the plan has changed. The following configuration options allows you to control how sensitive replanning should be to updates of the database.

dbms.index_sampling.background_enabled

Controls whether indexes will automatically be re-sampled when they have been updated enough. The Cypher query planner depends on accurate statistics to create efficient plans, so it is important it is kept up to date as the database evolves.

If background sampling is turned off, make sure to trigger manual sampling when data has been updated.

dbms.index_sampling.update_percentage
Controls how large portion of the index has to have been updated before a new sampling run is triggered.
cypher.statistics_divergence_threshold
Controls how much the above statistical information is allowed to change before an execution plan is considered stale and has to be replanned. If the relative change in any of statistics is larger than this threshold, the plan will be thrown away and a new one will be created. A threshold of 0.0 means always replan, and a value of 1.0 means never replan.

3.5.3.3. Manual index sampling

Index resampling can be triggered using two built-in procedures db.resampleIndex() and db.resampleOutdatedIndexes().

Here is an example of using cypher-shell to trigger resampling.

> cypher-shell 'CALL db.resampleIndex(":Person(name)");'

> cypher-shell 'CALL db.resampleOutdatedIndexes();'