By Bryce Merkl Sasaki, Editor-in-Chief, Neo4j | September 11, 2015
The world of NoSQL databases is far from uniform.
Unlike relational databases (RDBMS) which are fairly uniform, the ecosystem of NoSQL technologies brims with diversity. In order to understand the trade-offs of each NoSQL database, it’s essential that we tour it’s largest category: aggregate stores.
In this “Graph Databases for Beginners” blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. In past weeks, we’ve tackled why graphs are the future, why data relationships matter, the basics (and pitfalls) of data modeling, why a query language matters, why we need NoSQL databases and the trade-offs of ACID versus BASE consistency models.
This week, we’ll discuss the category of NoSQL databases known as aggregate stores.
Aggregate Stores and the World of NoSQL DatabasesThe group of NoSQL databases collectively known as aggregate stores (term coined by Martin Fowler) includes key-value stores, column family stores and document stores, which are all highlighted in blue below. (Next week, we’ll examine the various types of graph technologies, which are another facet of NoSQL.) It’s worth noting that aggregate stores eschew connections between aggregates – only graph databases fully capitalize on data relationships.
Key-Value StoresKey-value stores are large, distributed hashmap data structures that store and retrieve values organized by identifiers known as keys. Here’s a diagram of an example key-value store.
A basic key-value store. Source: Jorge Stolfi.As you can see, a bucket contains a specific number of values, and for fault-tolerance reasons, each bucket is replicated onto several machines. However, machines should never be exact copies of one another – not only for data replication purposes but also for better load balancing. An application wishing to store or retrieve data in a key-value store only needs to know (or compute) the corresponding key, which can be as natural as a username, an email address, Cartesian coordinates, a Social Security number or a ZIP code. With a sensibly designed system, the chance of losing data due to a missing key is low. In theory, key-value stores simply concern themselves with efficient storage and retrieval of data, unencumbered by its nature or usage. But this approach has its downsides: When extracting data from a stored value, applications often have to retrieve the entire value (which can be quite large) and then filter out any unwanted elements, which can be inefficient. Although simple, the key-value model doesn’t offer much insight into data relationships. In order to retrieve sets of information across several records, you typically need to conduct external processing with an algorithm like MapReduce, often producing highly latent results. However, key-value stores do have certain advantages. Since they’re descended from Amazon’s DynamoDB, they are optimized for high availability and scale. Or, as the Amazon team puts it, they should work even “if disks are failing, network routes are flapping or data centers are being destroyed by tornados.”
Column Family StoresColumn family stores (also known as wide-column stores) are based on a sparsely populated table whose rows can contain arbitrary columns and where keys provide for natural indexing. (Note: In the explanation below, we’ll use terminology from Apache Cassandra since it is one of the most popular column family stores). In the diagram below, you can see the four building blocks of a column family database.
The four building blocks of a column family database.The simplest unit of storage is the column itself consisting of a name-value pair. Any number of columns can then be combined into a super column, which gives a name to a particular set of columns. Columns are stored in rows, and when a row contains columns only, it is known as a column family, but when a row contains super columns, it is known as a super column family. At first it might seem odd to include rows when the data is mostly organized via columns, but in fact, rows are vital since they provide a nested hashmap for columnar data. Consider the diagram below of a super column family mapping out a recording artist and his albums.
Storing data in a super column familyIn a column family database, each row in the table represents a particular overarching entity (e.g., everything about an artist). These column families are containers for related pieces of data, such as the artist’s name and discography. Within the column families we find actual key-value data, such as album release dates and the artist’s date of birth. Here’s the kicker: This row-oriented view can also be turned 90 degrees to arrive at a column-oriented view. Where each row gives a complete view of one entity, the column view naturally indexes particular aspects across the whole dataset. For example, let’s look at the figure below:
Keys form a natural index through rows in a column family database.As you can see, by “lining up” keys we can find all the rows where the artist is English. From there it’s easy to extract complete artist data from each row. It’s not the same as the connected data as we’d find in a graph, but it does provide some insight into related entities. Column family databases are distinguished from document and key-value stores not only by their more expressive data model, but also by their architecture built for distribution, scale and failover. And yet they’re still aggregate stores and as such lack joins.
Document StoresPut simply, document databases store and retrieve documents just like an electronic filing cabinet. Documents can include maps and lists, allowing for natural hierarchies. In fact, document stores are most familiar to developers who are used to working with hierarchically structured documents. At the most basic level, documents are stored and retrieved by ID. If an application remembers the IDs it’s most interested in (such as usernames), then a document store acts much like a key-value store (see above). The document model usually involves having a hierarchical JSON document as the primary data structure, and any field inside of the hierarchy can then be indexed. For example, in the diagram below the embedded sub-documents are part of the larger user document.
Embedded data in a document store. Source: MongoDB Data Model Design documentation.Because document stores have a data model around disconnected entities, their major advantage is horizontal scaling. However, most document databases require developers to explicitly plan for sharding of data across instances to support this horizontal scale while key-value stores and column family stores don’t require this extra step. To see an example of how MongoDB – one of the most popular document stores – integrates with Neo4j, check out the Wanderu case study.
Query Versus Processing in NoSQL Aggregate StoresOn balance, the similarities between NoSQL aggregate stores are greater than the differences. While each has a different storage strategy, they all share similar characteristics when it comes to querying data. For simple queries, aggregate stores use indexing, basic document linking or a query language. However, for more complex queries, aggregate stores cannot generate deeper insights simply by examining individual data points. To compensate, an application typically has to identify and extract a subset of data and run it through an external processing infrastructure such as the MapReduce framework (often in the form of Apache Hadoop). MapReduce is a parallel programming model that splits data and operates on it in parallel before gathering it back together and aggregating it to provide focused information. For example, if we wanted to use MapReduce to count the number of Americans there are in a recording artists database, we’d need to extract all artist data and discard the non-American ones in the map phase. Then, we’d count the remaining records in the reduce phase. But even with a lot of machines and a fast network infrastructure, MapReduce can be quite latent. So latent in fact, that often a development team needs to introduce new indexes or ad hoc queries in order to focus (and trim) the dataset for better MapReduce speeds.
ConclusionAggregate stores are good at storing big sets of discrete data, but they do that by sacrificing a data model, language and functionality for handling data relationships.
The spectrum of databases for discrete versus connected data.If you try to use aggregate stores for interrelated data, it results in a disjointed development experience since you have to add a lot of code to fill in where the underlying aggregate store leaves off. And as the number of hops (or “degree” of the query) increases, aggregate stores slow down significantly. Graph databases, on the other hand, embrace relationships in order to solve problems that involve context and connectedness. Consequently, they have very different design principles and a different architectural foundation. Do aggregate stores have their perfect use cases? Certainly. But they aren’t for dealing with problems that require an understanding of how things are connected. Want to learn more about the world of NoSQL graph technologies? Click below to get your free copy of the O’Reilly Graph Databases ebook and discover how to apply graph solutions to mission-critical problems at your enterprise.
About the Author
Bryce Merkl Sasaki, Editor-in-Chief, Neo4j
Bryce Merkl Sasaki is the Editor-in-Chief at Neo4j. He studied professional and creative writing for undergrad and has been freelancing for 7 years. Recently, he worked at an inbound marketing agency in Philadelphia as a copywriter before moving to California. When not working, he likes to spend his time working on his novel, looking for pickup soccer games and reading voraciously.
From the CEO
Have a Graph Question?
Reach out and connect with the Neo4j staff.Stackoverflow
Share your Graph Story?
Email us: firstname.lastname@example.org