Graph Data Science

Graph Databases for Beginners: Why Connected Data Matters

Bryce Merkl Sasaki

Editor-in-Chief, Neo4j

July 17, 2018

6 min read

Learn the critical business value of data relationships and why they matter more than discrete data

“It’s not what you know, it’s who you know.”

Sound familiar?

Depressingly, we all know it’s true: The boss’s son has a better shot at that corner office than you do. That’s because in business – and in life – relationships matter a whole lot more than individual skills or competencies.

Here’s the kicker: Why don’t we think about our data in the same way?

Even as the volume of discrete, individual data points increases (and will continue to do so), the real value – the real bottom-line, business-defining ROI – comes from the connections between the data that’s collected.

Got 2 million Facebook likes? Great. Got 2000 committed shoppers to your ecommerce site? Super. But are you connecting who likes your Facebook page with who’s most likely to shop next? Are you drawing the relationships (or even able to draw the relationships) between Facebook promotions and purchased items from your most loyal shoppers?

2k, 2M or 2B – the number doesn’t matter if you’re not making the connection.

We live in an ever-more-connected world, and these critical data relationships (a.k.a. connected data), will only increase in years to come. If your team is going to succeed, you’ve got to leverage those data connections for all they’re worth – but you’ll need the right technology.

With so many systems built on relational databases or aggregate NoSQL stores, you may not know of a third option that outperforms them both: graph databases.

In this Graph Databases for Beginners blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. Last week, we tackled why graph technology is the future.

This week, we’ll discuss why data relationships matter – and how that realization affects your next choice of database.

The Irony of Relational Databases

Relational databases (RDBMS) were originally designed to codify paper forms and tabular structures, and they still do this exceedingly well. (There’s a reason they continue to dominate the database market for that use case.)

Ironically, however, relational databases aren’t effective at handling data relationships, especially when those relationships are added or adjusted on an ad hoc basis.

The greatest weakness of relational databases is that their model is too inflexible. Your business needs are constantly changing and evolving, but the schema of a relational database can’t efficiently keep up with those dynamic and uncertain variables.

To compensate, your development team can try to leave certain columns empty (tech lingo: nullable), but this approach requires more code to handle the greater number of exceptions in your data.

But the relational data model isn’t the only challenge: Performance matters too. Even worse, as your data multiplies in complexity and diversity (and it always will), your relational database becomes burdened with large JOIN tables which disrupt performance and hinder further development.

JOINs aren’t too bad if you’re only making two to three hops across tables, but once you start to make multiple hops (don’t think four, but fourteen or forty), then your RDBMS is doomed. In fact, the results may never fully calculate.

Unfortunately, your end-users can’t wait for never. They’re probably going to click or swipe away after more than two seconds (a lot less than never). Your database needs to meet – or exceed – their expectations. For connected data, an RDBMS isn’t up to the task.

Consider the example relational database model below.

An example relational database (RDBMS) data model

An example relational database model where some queries are inefficient-yet-doable (e.g., “What items did a customer buy?”) and other queries are prohibitively slow (e.g., “Which customers bought this product?”).

In order to discover what products a customer bought, your developers would need to write several JOIN tables which significantly slow the performance of the application.

Furthermore, asking a reciprocal question like, “Which customers bought this product?” or “Which customers buying this product also bought that product?” becomes prohibitively expensive. Yet, questions like these are essential if you want to build a proper recommendation engine for your transactional application.

At a certain point, your business needs will entirely outgrow this current database schema. The problem, however, is that migrating your data to a new RDBMS schema becomes incredibly effort-intensive.

Now imagine your business 10, 20 or 50 years from now. How often do you imagine your data model will need to evolve in order to match the changing needs of your business? (Hint: A lot.)

Why NoSQL Databases Don’t Fix the Problem Either

NoSQL (or Not only SQL) databases store sets of disconnected documents, values and columns, which in some ways gives them a performance advantage over relational databases. However, their disconnected construction makes it harder to harness connected data properly.

Some developers add data relationships to NoSQL databases by embedding aggregate identifying information inside the field of another aggregate (tech lingo: they use foreign keys). But joining aggregates at the application level later becomes just as prohibitively expensive as in a relational database (i.e., there’s no free lunch).

These foreign keys have another weak point too: they only “point” in one direction, making reciprocal queries too time-consuming to run. If you can’t imagine a scenario where you would never want to know a reciprocal query, then you’re not thinking big enough.

Developers usually work around this reciprocal-query problem by inserting backward-pointing relationships or by exporting the dataset to an external compute structure, like Hadoop, and computing the result with brute force. Either way, the results are slow and latent.

Graph Technology Puts Data Relationships at the Center

When you want a cohesive picture of your big data, including the connections between elements, you need a graph database. In contrast to relational and NoSQL databases, graph databases store data relationships as relationships. This explicit storage of relationship data means fewer disconnects between your evolving schema and your actual database.

In fact, the flexibility of a graph data model allows you to add new nodes and relationships without compromising your existing network or expensively migrating your data. All of your original data (and its original relationships) remain intact.

With data relationships at their center, graph databases are incredibly efficient when it comes to query speeds, even for deep and complex queries. In their book Neo4j in Action, Partner and Vukotic performed an experiment between a relational database and a graph database (Neo4j!).

Their experiment used a basic social network to find friends-of-friends connections to a depth of five degrees. Their dataset included 1,000,000 people each with approximately 50 friends. The results of their experiment are listed in the table below.

A performance experiment run between relational databases (RDBMS) and the Neo4j graph database

A performance experiment run between relational databases (RDBMS) and Neo4j shows that graph databases handle data relationships with high efficiency.

At the friends-of-friends level (depth two), both the relational database and graph database performed adequately. However, as the depth of connectedness increased, the performance of the graph database quickly outstripped that of the relational database. It turns out data relationships are vitally important.

This comparison isn’t to say NoSQL stores or relational databases don’t have a role to play (they certainly do), but they fall short when it comes to connected data relationships. Graph technology, however, is extremely effective at handling connected data.

And for mission-critical insights and nimble business agility, connected data matters.

Want to dive deeper into the world of graph database technology? Click below to get your free copy of the O’Reilly Graph Databases book and learn how to apply graph thinking to your biggest connected data challenges.

Get My Copy of the Book

Catch up with the rest of the Graph Databases for Beginners series:

Why Graph Technology Is the Future

The Basics of Data Modeling

Data Modeling Pitfalls to Avoid

Why a Database Query Language Matters

Imperative vs. Declarative Query Languages: What’s the Difference?

Graph Theory & Predictive Modeling

Graph Search Algorithm Basics

Why We Need NoSQL Databases

ACID vs. BASE Explained

A Tour of Aggregate Stores

Other Graph Data Technologies

Native vs. Non-Native Graph Technology