Graph Data Science

Graph Databases for Beginners: Why We Need NoSQL Databases

Neo4j Blog Contributor: Bryce Merkl Sasaki

Editor-in-Chief, Neo4j

October 25, 2018

7 min read

Learn why NoSQL databases are needed to face some of today's biggest data challenges that SQL can't

NoSQL databases are one of those things in life that are unhelpfully defined only by what they are not rather than by what they are, i.e., an anti-definition.

NoSQL is a cheeky acronym for Not Only SQL – or more confrontationally – No to SQL. This anti-definition tells you a lot about why the NoSQL movement began: SQL-based relational databases aren’t always enough.

Relational databases (RDBMS) still have their perfect use cases, and RDBMS often work well alongside NoSQL databases to tap the strengths of both technologies. (This is why Neo4j officially prefers Not only SQL as the definition of NoSQL, because SQL still has its place in any backend.) But it’s still abundantly clear that the relational data model can’t meet every data need.

So, once other data stores – and their accompanying data models – became available, there was (and continues to be) a meteoric rise in the popularity of NoSQL database technologies. Today, we’re going to define NoSQL databases in addition to justifying why we need them now more than ever.

In this Graph Databases for Beginners blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. In past weeks, we’ve tackled why graph technology is the future, why connected data matters, the basics (and pitfalls) of data modeling, why a database query language matters, the differences between imperative and declarative query languages, predictive modeling using graph theory and the basics of graph search algorithms.

This week, we’ll discuss the diverse and sundry world of NoSQL databases – and why they’ve become so popular.

The Many & Motley World of NoSQL Databases

NoSQL databases are a spectrum of data storage technologies that are more different than they are alike, so it’s difficult to make sweeping generalizations about their characteristics.

In the following weeks, we’ll explore a few types of NoSQL databases and other important NoSQL definitions. Our tour will encompass the group collectively known as aggregate stores (highlighted in blue below), including key-value stores, column family stores and document stores as well the various types of graph technologies (in green), which include property graphs, hypergraphs and RDF triple stores.

An overview of NoSQL database types and categories

An overview of the NoSQL database space. Quadrants in blue are collectively known as aggregate stores.

Historically, most enterprise-grade web applications ran on top of a relational database (RDBMS). But in the past decade alone, the data landscape has shifted significantly and in a way that traditional RDBMS deployments simply can’t manage.

The NoSQL database movement has emerged particularly in response to three of these data challenges:

Data volume
Data velocity
Data variety
Data valence

We’ll explore each of these challenges in further detail below.

Data Volume

It’s no surprise that as data storage has increased dramatically, data volume (i.e., the size of stored data) has become the principal driver behind the enterprise adoption of NoSQL databases.

Large datasets simply become too unwieldy when stored in relational databases. In particular, query execution times increase as the size of tables and the number of JOINs grow (so-called JOIN pain).

This isn’t always the fault of the relational databases themselves though. Rather, it has to do with the underlying data model.

In order to avoid JOIN pain, the NoSQL world has several alternatives to the relational model. While these NoSQL data models are better at handling today’s larger datasets, most of them are simply not as expressive as the relational model. The only exception is the graph model, which is actually more expressive. (More on that in the weeks to come.)

Data Velocity

But volume isn’t the only problem modern enterprise systems have to deal with. Besides being big, today’s data often changes rapidly.

Thus, data velocity (i.e., the rate at which data changes over time) is the next major challenge that NoSQL databases are designed to overcome.

Velocity is rarely a static metric. A lot of velocity measurements depend on the context of both internal and external changes to an application, some of which have considerable system-wide impact.

Coupled with high volume, variations in data velocity require a database to not only handle high levels of edits (tech lingo: write loads), but also deal with surging peaks of database activity. Relational databases simply aren’t prepared to handle a sustained level of write loads and can crash during peak activity if not properly tuned.

But there’s also another aspect of data velocity NoSQL technology helps us overcome: the rate at which the data structure changes. In other words, it’s not just about the rapid change of specific data points but also the rapid change of the data model itself.

Data structures commonly shift for two major reasons. First is the fast-moving nature of business. As an enterprise changes, so does its data needs.

Second is that data acquisition is often experimental. Sometimes your application captures certain data points just in case you might need them later on. The data that proves valuable to your business usually sticks around, but if it isn’t worthwhile, then those data points often fall by the wayside. Consequently, these experimental additions and eliminations affect your data model on a regular basis.

Both forms of data velocity are problematic for relational databases to handle. Frequently high write loads come with expensive processing costs, and regular data structure changes come with high operational costs (just ask your DBA).

NoSQL databases address both data velocity challenges by optimizing for high write loads and by having more flexible data models.

Data Variety

The third challenge in today’s data landscape is data variety – that is, it can be dense or sparse, connected or disconnected, regularly or irregularly structured.

Today’s data is far more varied than what relational databases were originally designed for. In fact, that’s why many of today’s RDBMS deployments have a number of nulls in their tables and null checks in their code – it’s all a workaround to adjust to today’s data variety.

On the other hand, NoSQL databases are designed from the bottom up to adjust for a wide diversity of data and flexibly address future data needs, each adopting their own strategy to how to handle the variety of data.

Data Valence

Whenever you talk about data, there’s always a lot of “V”s thrown around (I’ve chose three above, but there’s like a million to choose from). But there’s almost always one powerful “V” missing: data valence.

The Latin root of valence is the same as value, valere, which means to be strong, powerful, influential or healthy.

In chemistry, valence is the combining power of an element; in psychology, it is the intrinsic attractiveness of an object; and in linguistics, it’s the number of elements a word combines. In the context of big data, valence is the tendency of individual data to connect as well as the overall connectedness of datasets.

The valence of a dataset is measured as the ratio of connections to the total number of possible connections. The more connections within your dataset, the higher its valence.

Data valence increases over time but not uniformly. Network scientists (i.e., super nerds) have described preferential attachment (for example, the rich get richer) as leading to power-law distributions and scale-free networks with hub and spoke structures. Literally nothing in that previous sentence can be analyzed using a relational database.

Over time, highly dense and lumpy data networks tend to develop, in effect growing both your big data and its complexity. This is significant because densely yet unevenly connected data is difficult to unpack and explore with traditional analytics (such as those based on RDBMS data stores). Thus, the need for NoSQL technologies where relational databases aren’t enough.

(If you’re interested in learning more about data valence in particular, check out this ebook by Amy Hodler and Mark Needham, portions of which were used in this blog post.)

Conclusion

Relational databases can no longer handle the challenges posed by today’s data volume, velocity, variety or valence. Yet, understanding how NoSQL databases overcome these challenges is only the prelude of finding the right database for your enterprise use case.

In the coming weeks, we’ll explore the strengths and weaknesses of various NoSQL technologies so you can make the most informed decision possible.

Now that you’ve learned about NoSQL in general, it’s time to look closer at graph technology in particular: Get your copy of the O’Reilly Graph Databases book and start using graph technology to solve real-world problems.

Get the Book

Catch up with the rest of the Graph Databases for Beginners series:

Why Graph Technology Is the Future

Why Connected Data Matters

The Basics of Data Modeling

Data Modeling Pitfalls to Avoid

Why a Database Query Language Matters (More Than You Think)

Imperative vs. Declarative Query Languages: What’s the Difference?

Graph Theory & Predictive Modeling

Graph Search Algorithm Basics

ACID vs. BASE Explained

A Tour of Aggregate Stores

Other Graph Data Technologies

Native vs. Non-Native Graph Technology