Emil Eifrem

NOSQL – Scaling to Size and Scaling to Complexity

CEO & Co-Founder, Neo4j, Inc.

November 15, 2009

4 min read

About a week ago, following NOSQL East in Atlanta, Jonathan Ellis from the Cassandra project published a fantastic overview of the current NOSQL ecosystem. He analyzes 10 popular NOSQL databases along three axes: horizontal scalability, data model and internal persistence design. It’s a great read.

The third axis (internal persistence design) may not be terribly relevant for users of NOSQL systems ¹ but the position on the first two axes reveal some important underlying assumptions. In particular, it reveals a focus: is this NOSQL project oriented around scaling to size or scaling to complexity? ²

The four main NOSQL data models

Now, there are four main categories of NOSQL databases today. Before we get into how they differ in focus, let me just quickly run through them and outline a few key characteristics:

Key-Value Stores

Lineage: Amazon’s Dynamo paper and Distributed HashTables.
Data model: A global collection of key-value pairs.
Example: Dynomite

BigTable Clones (aka ‘ColumnFamily’)

Lineage: Google’s BigTable paper.
Data model: Column family, i.e. a tabular model where each row at least in theory can have an individual configuration of columns.
Example: HBase, Hypertable, Cassandra ³

Document Databases

Lineage: Inspired by Lotus Notes.
Data model: Collections of documents, which contain key-value collections (called ‘documents’).
Example: CouchDB and MongoDB

Graph Databases

Lineage: Draws from Euler and graph theory.
Data model: Nodes & relationships, both which can hold key-value pairs
Example: AllegroGraph, InfoGrid, Neo4j

Scalability focus

How then do these data models scale to size and complexity? Check out this slide from my presentation at NOSQL east:

The exact positions in the picture above are obviously debatable but I think it serves to illustrate my point: the key value stores and BigTable clones of the world handle size really well. This is because they have data models that can easily be partitioned horizontally, which is great for scale out of simple two-column data, like a whole bunch of username/password pairs.

The drawback however, is that by constraining themselves to simpler data models, they’ve pushed complexity up the stack. So if you have data with a non-trivial structure, then you have to compensate for a simple data model by adding more complex functionality in the upper layers. ⁴

Document databases and graph databases, on the other hand, have opted for richer data models. This means that they have more powerful abstractions that make it easy to model both simple and complex domains. But these richer data models introduce more coupling of data and therefore it’s more challenging to get them to scale to size.

Size matters (but you’re not Google so complexity matters more)

Now, size gets a lot of attention because scaling out to hundreds of machines is very sexy. But here’s the kicker: the majority of the use cases out there don’t need to store hundreds of billions of objects and scale out to truckloads of machines.

At the end of the day, there are only so many projects of Amazon and Google scale out there. A lot of projects fit within a couple of BILLIONS of objects. For most people, it’s a lot more important to have a rich data model that lends itself to easily represent their domain.

Ben Scofield of Viget Labs expresses it eloquently in NoSQL Misconceptions:

‘… there’s a lot more to NoSQL than just performance and scaling. Most importantly (for me, at least) is that NoSQL DBs often provide better substrates for modeling business domains. I’ve spent more than two years struggling to map just part of the comic book business onto MySQL, for instance, where something like a graph database would be a vastly better fit.’

Choose your hammer wisely

It’s important to note that these data models are all isomorphic. Which is a fancy way of saying that you can express all datasets in either one of them. For example, you can decompose any data into a collection of key-value pairs.

But that’s a bit like claiming you can write any program in any Turing complete programming language: sure, it’s true in theory but just because you can doesn’t mean that you should. In practice, there’s a bunch of programming languages that are a poor fit for many use cases and the same is true of data models.

I think it’s clear that we’re rapidly moving beyond the era of the One Size Fits All database. Whereas in the past you could always trust that any decent-sized app had a relational database as backend, it’s now increasingly about matching your dataset to whatever data model fits best. NOSQL is not No To SQL. NOSQL means Not Only SQL, as in: in the future, our backends will consist of Not Only SQL databases but also key-value stores, graph databases and more.

NOSQL is about choice and picking the right tool for the job. When you look at adding a NOSQL database to your current project, consider your requirements both for scaling to size and for scaling to complexity.

¹ Few developers care whether their RDBMS implementation uses hash joins or nested loop joins.

² Scaling to size and scaling to complexity was introduced (at least to me) in O’Reilly’s Beautiful Data by Toby Segaran and Jeff Hammerbacher. The graph of the various NOSQL data models was first visualized by my friend and colleague Peter Neubauer.

³ Cassandra is actually the first of the ‘second-generation’ NOSQL databases and it combines the decentralized scale out architecture of the Dynamo clones with the data model of BigTable.

⁴ As an analogy, imagine writing any piece of software and the *only* construct you had for storing state was a single global hashtable. No linked lists, no arrays, no structs, no objects. Imagine how much code you’d have to add just to work around that hashtable! Now, a key-value store is basically a distributed hashtable. This is why they have problems with scaling to complexity.