11.5. Compressed storage

This section explains Neo4j property value compression and disk usage.

Neo4j can in many cases compress and inline the storage of property values, such as short arrays and strings, with the purpose of saving disk space and possibly an I/O operation.

Compressed storage of short arrays

Neo4j will try to store your primitive arrays in a compressed way. To do that, it employs a "bit-shaving" algorithm that tries to reduce the number of bits required for storing the members of the array. In particular:

  1. For each member of the array, it determines the position of leftmost set bit.
  2. Determines the largest such position among all members of the array.
  3. It reduces all members to that number of bits.
  4. Stores those values, prefixed by a small header.

That means that when even a single negative value is included in the array then the original size of the primitives will be used.

There is a possibility that the result can be inlined in the property record if:

For example, an array long[] {0L, 1L, 2L, 4L} will be inlined, as the largest entry (4) will require 3 bits to store so the whole array will be stored in 4 × 3 = 12 bits.

However, the array long[] {-1L, 1L, 2L, 4L} will require the whole 64 bits for the -1 entry, so it needs 64 × 4 = 32 bytes and it will end up in the dynamic store.

Compressed storage of short strings

Neo4j will try to classify your strings in a short string class, and if it manages that it will treat it accordingly. In this case, it will be stored without indirection in the property store; inlining it instead in the property record. This means that the dynamic string store will not be involved in storing that value, which will lead to a reduced disk footprint. Additionally, when no string record is needed to store the property, it can be read and written in a single lookup, which will lead to performance improvements and less disk space required.

Table 11.6. Various classes for short strings
String class Description

Numerical

Consists of digits 0..9, and the punctuation space, period, dash, plus, comma, and apostrophe.

Date

Consists of digits 0..9, and the punctuation space, dash, colon, slash, plus, and comma.

Hex (lower case)

Consists of digits 0..9, and lower case letters a..f

Hex (upper case)

Consists of digits 0..9, and upper case letters A..F

Upper case

Consists of upper case letters A..Z, and the punctuation space, underscore, period, dash, colon, and slash.

Lower case

Like upper case, but with lower case letters a..z.

E-mail

Consists of lower case letters a..z, and the punctuation comma, underscore, period, dash, plus, and the @ symbol.

URI

Consists of lower case letters a..z, digits 0..9, and most available punctuation.

Alpha-numerical

Consists of upper case letters A..Z, lower case letters a..z, digits 0..9, and punctuation space and underscore.

Alpha-symbolical

Consists of upper case letters A..Z, lower case letters a..z, and the punctuation space, underscore, period, dash, colon, slash, plus, comma, apostrophe, @ symbol, pipe, and semicolon.

European

Consists of most accented European characters and digits, plus punctuation space, dash, underscore, and period — like Latin1 but with less punctuation.

Latin1

 

UTF-8

 

In addition to the string’s contents, the number of characters also determines if the string can be inlined or not. Each class has its own character count limits, as described below:

Table 11.7. Character count limits
String class Character count limit

Numerical, Date, and Hex

54

Upper case, Lower case, and E-mail

43

URI, Alpha-numerical, and Alpha-symbolical

36

European

31

Latin1

27

UTF-8

14

That means that the largest inline-able string is 54 characters long, and must be of the Numerical class. Additionally, all strings of size 14 or less will always be inlined.

The limits described in this section are for the default 41 byte PropertyRecord layout. If that parameter is changed via editing the source and recompiling, the above will have to be recalculated.