Will It Graph? Identifying a Good Fit for Graph Databases  –  Part 1


Graph Databases vs. Relational Databases: What’s the Difference?


How do graph databases work under the hood and how is this different from relational databases?

In this Will It Graph blog series, we’re going to explore the differences between native graph databases and relational databases and help you understand how to identify the right database system for your application. (Or listen to the GraphStuff.FM podcast!)

Relational Databases: Normalization and JOIN

A lot of developers are familiar with the traditional relational database, where data is stored in tables within a well-defined schema. Each row in the table is a discrete entity of data. One of these elements in the row is typically used to define its uniqueness: the primary key. It could be a unique ID or maybe something like a social security number for a person.

We then go through a process called normalization to reduce data repetition. In normalization, we’re moving references, something like an address for a person, into another table. So we get a reference from the row representing the entity to the row representing the address for that person.

If, for example, somebody changes their address, you wouldn’t want multiple versions of that person’s addresses everywhere and have to try and remember all the different instances of where that person’s addresses exist. Normalization makes sure you have one version of the data, so you can make the updates in one place.

Then when we query, we want to reconstitute this normalized data. We do what’s called a JOIN operation. In our main entity row, we have the primary key that identifies the ID for the entity, let’s say the person. We also have what’s called a foreign key that represents a row in our address table. We join the two tables through their primary and foreign keys, and use that to look up the address in the address table. This is called a JOIN and these JOINs are done at query time and at read time.

When we’re doing a JOIN in a relational database, it’s a set comparison operation where we’re looking to see where our two sets of data overlap (in this case, the sets are the person table and the address table). At a high level, that’s how traditional relational databases work.

Native Graph Databases: Connections and Index-Free Adjacency

Let’s have a quick peek at a native graph database and how it works. We spoke about the discrete entity in a relational database being a row within a table. In a native graph database, that row would be the equivalent of a node. It’s still a discrete entity, so we still have this element of normalization.

A node would be an entity. If we were having person nodes, we would have one node for one person. And we would have some degree of uniqueness in it, let’s say the social security number. The key difference, however, is when we are connecting this person node to another discrete entity, for example, an address, we create a physical connection (aka relationship) between those two points.

The address would have a pointer that says, what is the outbound part of the relationship that connects to the node? We then have another pointer for the inbound part of the relationship pointing to the other node. So, effectively, we’re collecting a set of pointers, and this is a manifestation of the physical connection between those two entities. That is a big difference.

In a relational database, how you would reconstitute the data is it joins on read, which means at query time, it would go off to try and figure out how things map together.

In a graph database, since we already know these two elements are connected, we don’t need to look up the mapping at query time. All we’re doing is follow the stored relationships to the other nodes. This is something we call index-free adjacency.

This concept of index-free adjacency is key to understanding the performance optimizations of a native graph database compared to other database systems.

Index-free adjacency means that during a local graph traversal, following these pointers (relationships) that connect the nodes in my graph, the performance of the operation is not dependent on the overall size of the graph. It depends on the number of relationships connected to the nodes that you’re traversing.

When we talk of a JOIN being a set operation (intersection), we’re using an index in a relational database to see where those two sets overlap. This means that the performance of the JOIN operation starts to slow down as the tables get bigger. In big O notation terms, this is something like logarithmic growth using an index — something like O(log n) and also grows exponentially with the number of JOINs in your query.

Whereas traversing relationships in the graph is more of linear growth based on the number of relationships in the nodes that we’re actually traversing, not the overall size of the graph.

This is the fundamental query time optimization that graph databases make that give us index-free adjacency. From a performance perspective, that is really the most important thing to think about when we think of a native graph database.

What does this really mean? There’s a number of really powerful things in this paradigm shift of how we’re storing the data. The key one here is that we don’t need to hypothesize about how connections are being made between discrete entities. Either they exist or they don’t exist.

This is really powerful because we don’t have to try and predict the number of different JOINs or traversals we’d need to make between different elements, which we would have to do in a traditional relational database system.

This gives us a number of interesting opportunities. One of them is it makes it a lot easier for us to query patterns.

  • We could be looking for a specific start point, such as looking for a specific person based on a specific social security number.
  • But what if we’re looking at patterns such as a person node being connected to an address that is connected to a series of other nodes? We can declare a pattern and just search for that pattern. That’s a lot easier in this setup.

If we’re doing something that would be very JOIN-heavy in a traditional database system, it is a lot faster in a graph database.

But at What Cost?

That sounds good, but obviously there has to be some trade-off, right? We don’t get anything for free in the world of computing. Building a database is really an exercise in optimizing for different trade-offs. So it’s important to understand the trade-offs that you’re making when you’re optimizing for this idea of graph-native index-free adjacency performance.

Read vs. Write

One of those trade-offs is the optimizations for read time versus write time. Think of a relationship (pointer, connection, edge) in the property graph model as a first-class citizen. They’re explicitly part of the data model. When we compare it to a relational database, a relationship is like a materialized JOIN.

In a relational database, we materialize JOINs on read at query time, which means we’re taking that performance hit to see where the JOINs are with each query.

Whereas in a graph database, we’re just basically following pointers, going to offsets in memory to find the other nodes that are connected as we traverse the graph.

This means that, at write time, we have to materialize and store these relationships. So in a graph database, we may have different write performance than we might expect in a relational database where we’re not materializing these JOINs at write time.

Why Haven’t Graph Databases Been Around for Forever?

Graph databases are a relatively new thing. The reason why graph databases have very performant systems that can hold billions or even trillions of nodes and relationships and run at high speed is due to the advancement in hardware.

Graph databases like to have memory, and RAM is cheap and plentiful these days. This allows us to tap into the real power of graph databases.

Conclusion

Now you know how relational databases and graph databases work differently. So if your application is JOIN-heavy in a traditional relational database and suffering from SQL strain, perhaps it’s time to consider switching to graph databases.

In the next part of the blog series, we’ll show you how to identify graph-shaped problems and some examples of the good and poor fits for graph databases.

Will It Graph? Identifying A Good Fit For Graph Databases | GraphStuff.FM: The Neo4j Graph Database Developer Podcast



Watch this demo to see how the biggest graph database in the world with more than a trillion relationships performs.

Watch Now

Will It Graph? Identifying a Good Fit for Graph Databases — Part 1 was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.