11.6. Linux file system tuning

This section covers Neo4j I/O behavior, and how to optimize for operations on disk.

Databases often produce many small and random reads when querying data, and few sequential writes when committing changes.

By default, most Linux distributions schedule I/O requests using the Completely Fair Queuing (CFQ) algorithm, which provides a good balance between throughput and latency. The particular I/O workload of a database, however, is better served by the Deadline scheduler. The Deadline scheduler gives preference to read requests, and processes them as soon as possible. This tends to decrease the latency of reads, while the latency of writes goes up. Since the writes are usually sequential, their lingering in the I/O queue increases the change of overlapping or adjacent write requests being merged together. This effectively reduces the number of writes that are sent to the drive.

On Linux, the I/O scheduler for a drive, in this case sda, can be changed at runtime like this:

$ echo 'deadline' > /sys/block/sda/queue/scheduler
$ cat               /sys/block/sda/queue/scheduler
noop [deadline] cfq

Another recommended practice is to disable file and directory access time updates. This way, the file system won’t have to issue writes that update this meta-data, thus improving write performance. This can be accomplished by setting the noatime,nodiratime mount options in fstab, or when issuing the disk mount command.

Since databases can put a high and consistent load on a storage system for a long time, it is recommended to use a file system that has good aging characteristics. The EXT4 and ZFS file systems generally cope well with ageing; thus they are recommended as a first choice.

XFS can have a slightly higher write throughput than EXT4, and unlike EXT4 supports files that are larger than 32 TiB. However, it needs additional tuning to improve its ageing characteristics. A careful read of the xfs and mkfs.xfs man pages is recommended if you wish to run Neo4j on XFS, and that you have at least the crc, finobt, and sparse XFS options enabled.

Running Neo4j on the BtrFS is not advised, since it can behave badly when it is close to running out of storage space.

A high read and write I/O load can also degrade SSD performance over time. The first line of defense against SSD wear, is to ensure that the working dataset fits in RAM. A database with a high write workload will, however, still cause wear on SSDs. The simplest way to combat this is to over-provision; use SSDs that are at least 20% larger than you strictly need them to be. A larger drive gives the wear-levelling algorithm more room to work with. Enterprise drives generally have higher endurance than consumer drives. V-NAND has higher endurance than planer NAND, and TLC NAND lasts longer than QLC. Likewise MLC lasts longer than TLC, but at the cost of reduced drive capacity.