News

Neo4j 1.8.M06 – Rolling Upgrades

July 13, 2012

3 min read

Neo4j 1.8 Milestone 6 covers all major improvements of the 1.8 roadmap. Among the usual tweaks and updates, this milestone provides a welcome feature for operations engineers – rolling upgrades across a cluster.

Rolling Upgrades

There is a subtle operational challenge when managing database upgrades over a cluster. We chatted with the ever clever Chris Gioran about rolling upgrades:

ABK: So Chris, what prompted the development of rolling upgrades?
CG: What we’re trying to achieve is, when you have an HA cluster that runs on a capable version — starting from 1.5.3 onwards, including the 1.6 and 1.7 series — the exercise is to upgrade everything without disturbing the operation of the cluster. The cluster will upgrade, while continuing to serve requests from either slaves or masters.
ABK: Can’t this be done today by just upgrading one instance at a time, leaving the rest running?
CG: Not necessarily.
ABK: What’s the problem with that?
CG: The problem is when we have breaking changes in the protocol used to communicate between instances. For example, going from 1.5.3 to 1.7, it’s not possible to have a slave on 1.7 talking to a 1.5 master (or vice versa) because we’ve made changes for performance and stability to the protocol itself.
ABK: With rolling upgrades, each of these different versions, though speaking different protocols, will gracefully coordinate?
CG: Yes.
ABK: Describe how that actually happens.
CG: So the rolling upgrade, actually, works exactly as you’d expect an upgrade would work. If there are not breaking changes between versions, you normally begin with the slaves, powering down, copying the store, migrating configuration if needed, then bringing that server back up. The new version would take over, communicate with the rest of the cluster and you wouldn’t notice anything.
A rolling upgrade offers that with versions that have incompatible protocols. Each slave, as it is brought up, detects the version running in the cluster and gracefully falls back into a compatibility mode that doesn’t allow it to become master, but allows it to continue to execute transactions.
ABK: Does order matter?
CG: Ordering does matter. It won’t break things, but it is better to start with the slaves. We’ve defined the point where the cluster as a whole has an upgraded version, so the moment that master switch happens it switches from the old version to the new version. You leave the master as the last machine running the old version. When you bring that down then a new version will become master. The rest of the slaves will detect that, then will roll forward to the new version, and continue operating.
ABK: That sounds great. And all the way back to 1.5.3. This is fantastic. Thanks so much for explaining this, Chris.
CG: Happy to make things work.

Notable Changes

Kernel:

Deprecated AbstractGraphDatabase.transactionRunning()
Changed synchronization of applying transactions to prevent a deadlock scenario
Original cause can be extracted from a transaction RollbackException

Server:

Fixed issue that stopped the server from starting without the UDC-jars.

Cypher:

Fixed problem when graph elements are deleted multiple times in the same query
Fixed #625: Values passed through WITH have the wrong type
Fixed #654: Some paths are returned the wrong way

HA:

Added transaction push factor that can be configured with number of slaves to which a transaction should be pushed. The master will optimistically push each transaction before tx.finish completes to reduce risk of branched data.
Added the ability for rolling upgrades from versions 1.5.3 onwards.
Changed the way master election notification and data gathering works, leading to massively reduced writing of data to the zookeeper service and a subsequent performance increase.