24.6. Caches in Neo4j

For how to provide custom configuration to Neo4j, see Section 24.1, “Introduction”.

Neo4j utilizes two different types of caches: A file buffer cache and an object cache. The file buffer cache caches the storage file data in the same format as it is stored on the durable storage media. The object cache caches the nodes, relationships and properties in a format that is optimized for high traversal speeds and transactional writes.

File buffer cache

The file buffer cache caches the Neo4j data in the same format as it is represented on the durable storage media. The purpose of this cache layer is to improve both read and write performance. The file buffer cache improves write performance by writing to the cache and deferring durable write until the logical log is rotated. This behavior is safe since all transactions are always durably written to the logical log, which can be used to recover the store files in the event of a crash. It also improves write performance by batching up many small writes into fewer page-sized writes.

Since the operation of the cache is tightly related to the data it stores, a short description of the Neo4j durable representation format is necessary background. Neo4j stores data in multiple files and relies on the underlying file system to handle this efficiently. Each Neo4j storage file contains uniform fixed size records of a particular type:

Store file Record size Contents

neostore.nodestore.db

15 B

Nodes

neostore.relationshipstore.db

34 B

Relationships

neostore.propertystore.db

41 B

Properties for nodes and relationships

neostore.propertystore.db.strings

128 B

Values of string properties

neostore.propertystore.db.arrays

128 B

Values of array properties

For strings and arrays, where data can be of variable length, data is stored in one or more 120B chunks, with 8B record overhead. The sizes of these blocks can actually be configured when the store is created using the string_block_size and array_block_size parameters. The size of each record type can also be used to calculate the storage requirements of a Neo4j graph or the appropriate cache size for each file buffer cache. Note that some strings and arrays can be stored without using the string store or the array store respectively, see Section 24.9, “Compressed storage of short strings” and Section 24.10, “Compressed storage of short arrays”.

Neo4j uses multiple file buffer caches, one for each different storage file. Each file buffer cache divides its storage file into a number of equally sized windows. Each cache window contains an even number of storage records. The cache holds the most active cache windows in memory and tracks hit vs. miss ratio for the windows. When the hit ratio of an uncached window gets higher than the miss ratio of a cached window, the cached window gets evicted and the previously uncached window is cached instead.

[Important]Important

Note that the block sizes can only be configured at store creation time.

Configuration

Parameter Possible values Effect

mapped_memory_total_size

The maximum amount of memory to use for the file buffer cache, either in bytes (or greater byte-like units, such as 100M for 100 mega-bytes, or 4G for 4 giga-bytes) or a percentage of the available amount of memory.

The amount of memory to use for mapping the store files, either in bytes or as a percentage of available memory. This will be clipped at the amount of free memory observed when the database starts, and automatically be rounded down to the nearest whole page. For example, if 500MB is configured, but only 450MB of memory is free when the database starts, then the database will map at most 450MB. If 50% is configured, and the system has a capacity of 4GB, then at most 2GB of memory will be mapped, unless the database observes that less than 2GB of memory is free when it starts.

mapped_memory_page_size

The number of bytes per page. Preferably a power-of-two that is close to the native sector size or page size of the durable medium.

The size of the pages, in bytes, that the database will use for mapping the store files into memory. Don’t change this without carefully measuring the performance implications.

string_block_size

The number of bytes per block.

Specifies the block size for storing strings. This parameter is only honored when the store is created, otherwise it is ignored. Note that each character in a string occupies two bytes, meaning that a block size of 120 (the default size) will hold a 60 character long string before overflowing into a second block. Also note that each block carries an overhead of 8 bytes. This means that if the block size is 120, the size of the stored records will be 128 bytes.

array_block_size

Specifies the block size for storing arrays. This parameter is only honored when the store is created, otherwise it is ignored. The default block size is 120 bytes, and the overhead of each block is the same as for string blocks, i.e., 8 bytes.

dump_configuration

true or false

If set to true the current configuration settings will be written to the default system output, mostly the console or the logfiles.

When configuring the amount of memory allowed for the file buffers and the JVM heap, make sure to also leave room for the operating systems page cache, and other programs and services the system might want to run. It is important to configure the memory usage, such that the Neo4j JVM process won’t need to use any swap memory, as this will cause a significant drag on the performance of the system.

When reading the configuration parameters on startup Neo4j will automatically configure the parameters that are not specified. The cache sizes will be configured based on the available memory on the computer.

Object cache

The object cache caches individual nodes and relationships and their properties in a form that is optimized for fast traversal of the graph. There are two different categories of object caches in Neo4j.

Firstly, there are the reference caches. With these caches, Neo4j will utilize as much of the allocated JVM heap memory as it can to hold nodes and relationships. It relies on garbage collection for eviction from the cache in an LRU manner. Note however that Neo4j is “competing” for the heap space with other objects in the same JVM, such as a your application (if deployed in embedded mode) or intermediate objects produced by Cypher queries, and Neo4j will yield to the application or query by using less memory for caching.

[Note]Note

The High-Performance Cache described below is only available in the Neo4j Enterprise Edition.

The other is the High-Performance Cache which gets assigned a certain maximum amount of space on the JVM heap and will purge objects whenever it grows bigger than that. Objects are evicted from the high performance cache when the maximum size is about to be reached, instead of relying on garbage collection (GC) to make that decision. With the high-performance cache, GC-pauses can be better controlled. The overhead of the High-Performance Cache is also much smaller as well as insert/lookup times faster than for reference caches.

[Tip]Tip

The use of heap memory is subject to the Java Garbage Collector — depending on the cache type some tuning might be needed to play well with the GC at large heap sizes. Therefore, assigning a large heap for Neo4j’s sake isn’t always the best strategy as it may lead to long GC-pauses. Instead leave some space for Neo4j’s filesystem caches. These are outside of the heap and under under the kernel’s direct control, thus more efficiently managed.

The content of this cache are objects with a representation geared towards supporting the Neo4j object API and graph traversals. Reading from this cache may be 5 to 10 times faster than reading from the file buffer cache. This cache is contained in the heap of the JVM and the size is adapted to the current amount of available heap memory.

Nodes and relationships are added to the object cache as soon as they are accessed. The cached objects are however populated lazily. The properties for a node or relationship are not loaded until properties are accessed for that node or relationship. String (and array) properties are not loaded until that particular property is accessed. The relationships for a particular node is also not loaded until the relationships are accessed for that node.

Configuration

The main configuration parameter for the object cache is the cache_type parameter. This specifies which cache implementation to use for the object cache. Note that there will exist two cache instances, one for nodes and one for relationships. The available cache types are:

cache_type Description

none

Do not use a high level cache. No objects will be cached.

soft

Provides optimal utilization of the available memory. Suitable for high performance traversal. May run into GC issues under high load if the frequently accessed parts of the graph does not fit in the cache.

This is the default cache type in Neo4j Community Edition.

weak

Provides short life span for cached objects. Suitable for high throughput applications where a larger portion of the graph than what can fit into memory is frequently accessed.

strong

This cache will hold on to all data that gets loaded to never release it again. Provides good performance if your graph is small enough to fit in memory.

hpc

The High-Performance Cache. Provides means of assigning a specific amount of memory to dedicate to caching loaded nodes and relationships. Small footprint and fast insert/lookup. Should be the best option for most scenarios. See below on how to configure it. <<<<<<< HEAD Note that this option is only available in Neo4j Enterprise Edition where it is the default implementation. ======= This cache type is only available in the Neo4j Enterprise Edition, and is the default in that edition. >>>>>>> upstream/2.1-maint

High-Performance Cache

How much memory to allocate to the High Performance Cache can be fine tuned to suit your use case. There are two ways to configure memory usage for it.

Standard configuration

For most use cases, simply specifying a percentage of the memory available for caching to use is enough to tune the High-Performance Cache.

Allocating more memory to the cache gives faster querying speed, but takes memory away from other components and may put strain on the JVM garbage collector.

The max amount of memory available for caching depends on which garbage collector you are using. For CMS and G1 collectors (CMS is the Neo4j default), it will be equal to the max size of the old generation. How big the old generation is is platform-dependent, but can for CMS and G1 be configured using the "NewRatio" JVM configuration option. For other collectors, it will be half of the total heap size.

configuration option Description (what it controls) Example value

cache.memory_ratio

Percentage, 0-100, of memory available for caching to use for caching. Default is 50%.

50.0

Advanced configuration

The advanced configuration gives more fine-grained control of how much memory to allocate specific parts of the cache. There are two aspects to this configuration.

One is the size of the array referencing the objects that are put in the cache. It is specified as a fraction of the heap, for example specifying 5 will let that array itself take up 5% out of the entire heap. Increasing this figure (up to a maximum of 10) will reduce the chance of hash collisions at the expense of more heap used for it. More collisions means more redundant loading of objects from the low level cache.

configuration option Description (what it controls) Example value

node_cache_array_fraction

Fraction of the heap to dedicate to the array holding the nodes in the cache (max 10).

7

relationship_cache_array_fraction

Fraction of the heap to dedicate to the array holding the relationships in the cache (max 10).

5

The other aspect is the maximum size of all the objects in the cache. It is specified as size in bytes, for example 500M for 500 megabytes or 2G for two gigabytes. Right before the maximum size is reached a purge is performed where (currently) random objects are evicted from the cache until the cache size gets below 90% of the maximum size. Optimal settings for the maximum size depends on the size of your graph. The configured maximum size should leave enough room for other objects to coexist in the same JVM, but at the same time large enough to keep loading from the low level cache at a minimum. Predicted load on the JVM as well as layout of domain level objects should also be take into consideration.

configuration option Description (what it controls) Example value

node_cache_size

Maximum size of the heap memory to dedicate to the cached nodes.

2G

relationship_cache_size

Maximum size of the heap memory to dedicate to the cached relationships.

800M

You can read about references and relevant JVM settings for Sun HotSpot here: