The heart behind the phrase is that if you’re trying to be good at everything, you normally end up being sort of mediocre at a large number of things and not being good at any one thing in particular. Software, technology and – you guessed it – graph databases are no exception.
Databases satisfy all different kinds of functions: batch and transactional workloads, memory access and disk access, SQL and XML access and graph and document data storage.
When building a database management system (DBMS), development teams must decide early on what cases to optimize for, which will dictate how well the DBMS will handle the tasks it is dealt (i.e., what the DBMS will be amazing at, what it will be ok at and what it may not do so well).
As a result, the graph database world is populated with both technology designed to be “graph first,” known as native graph technology, and databases where graphs are a bolted-on afterthought, classified as non-native graph technology.
There’s a considerable difference when it comes to the native architecture of both graph storage and processing. Unsurprisingly, native technologies tend to perform queries faster, scale bigger (retaining their hallmark query speed as the dataset grows in size) and run more efficiently, calling for much less hardware. Non-native graphs, not so much.
It’s critical to understand the differences – especially if you’re about to consider purchasing a new database license.
In this Graph Databases for Beginners blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. In past weeks, we’ve tackled why graph technology is the future, why connected data matters, the basics (and pitfalls) of data modeling, why a database query language matters, the differences between imperative and declarative query languages, predictive modeling using graph theory, the basics of graph search algorithms, why we need NoSQL databases, the differences between ACID and BASE consistency models, a (brief) tour of aggregate stores and a survey of other graph technologies.
This week, we’ll discuss the key characteristics that distinguish native graph database technology – and why they matter for database performance.
Overview: What “Graph First” Means in Native Graph Technology
There are two main elements that distinguish native graph technology: storage and processing.
Graph storage commonly refers to the underlying structure of the database that contains graph data. When built specifically for storing graph-like data, it is known as native graph storage. Graph databases with native graph storage are optimized for graphs in every aspect, ensuring that data is stored efficiently by writing nodes and relationships close to each other.
Graph storage is classified as non-native when the storage comes from an outside source, such as a relational, wide-column or other NoSQL database. These databases store data about nodes and relationships, which may end up far apart in actual storage. This non-native approach leads to latent results as their storage layer is not optimized for graphs.
Native graph processing is another key element of graph technology, referring to how a graph database processes database operations, including both storage and queries. Index-free adjacency is the key differentiator of native graph processing.
At write time, index-free adjacency speeds up processing by ensuring that each node is stored directly to its adjacent nodes and relationships. Then, during query processing (i.e., read time), index-free adjacency ensures lightning-fast retrieval without a heavy reliance on indexes. Non-native graph processing often uses a large number of indexes in order to complete a read or write transaction, significantly slowing down the operation.
Another important consideration is ACID writes. Connected data requires an uncommonly strict need for data integrity beyond that of other NoSQL models. In order to store a connection between two things, we must not only write a relationship record but update the node at each end of the relationship as well. If any one of these three write operations fails, it will result in a corrupted graph (literally, the worst).
The only way to ensure that graphs aren’t corrupted over time is to carry out writes as fully ACID compliant transactions. Systems with native graph processing include the proper internal guard rails to ensure that data quality remains impervious to network blips, server failures, competing transactions and the like.
Let’s take a closer look why native graph storage and native graph processing are so critical.
Native Graph Storage
What makes graph storage distinctively native is the architecture of the graph database from the ground up. Graph databases with native graph storage have underlying storage designed specifically for the storage and management of graphs. They are designed to maximize the speed of traversals during arbitrary graph algorithms.
For example, let’s take a look at the way Neo4j – a native graph database – is structured for native graph storage. Every layer of this architecture – from the Cypher query language to the files on disk – is optimized for storing graph data, and not a single part is bolted on frankenstein-esque in from other non-graph technologies.
Graph data is kept in store files, each of which contain data for a specific part of the graph, such as nodes, relationships, labels and properties. Dividing the storage in this way facilitates highly performant graph traversals (as detailed above).
In a native graph database, a node record’s main purpose is to simply point to lists of relationships, labels and properties, making it lightweight.
So, what makes non-native graph storage different from storage in a native graph database?
Non-native graph storage uses a relational database, a columnar database or some other general-purpose data store rather than being specifically engineered for the uniqueness of graph data. While the typical operations team might be more familiar with a non-graph backend (like MySQL or Cassandra), the disconnect between graph data with non-graph storage results in a number of performance and scalability concerns.
Non-native graph databases are not optimized for storing graphs, so the algorithms utilized for writing data may store nodes and relationships all over the place. This causes performance problems at the time of retrieval because all these nodes and relationships then have to be reassembled for every single query. In a 24×7 production scenario, that could be thousands of queries a minute.
On the other hand, native graph storage is built to handle highly interconnected datasets from the ground up and is therefore the most efficient when it comes to the storage and retrieval of graph data.
Native Graph Processing
A graph database has native processing capabilities if it uses index-free adjacency. This means that each node directly references its adjacent nodes, acting as a micro-index for all nearby nodes. Index-free adjacency is cheaper and more efficient than doing the same task with indexes, because query times are proportional to the amount of the graph searched, rather than increasing with the overall size of the data.
That last sentence may have made your eyes glaze over, but I can’t emphasize it enough: Without index-free adjacency, a large graph dataset will be crushed under its own weight because queries will take longer and longer as the dataset grows. On the flipside, native graph queries perform at a constant rate, no matter the size of your data.
Since graph databases store relationship data as first-class entities, relationships are easier to traverse in any direction with native graph processing. With processing that is specifically built for graph datasets, relationships – rather than over-reliance on indexes – are used to maximize the efficiency of traversals.
See the image below of a basic social network where queries are performed on the natively processed relationship data (who is connected to whom?), without the need of further index lookups.
On the other hand, non-native graph databases use many types of indexes to link nodes together. This method is more costly, as the indexes add another layer to each read and write, which slows processing considerably.
Queries with more than one layer of connection (i.e., the very type of query you’d want or need from a graph database) further reduce traversal performance with non-native graph processing.
The image below illustrates an example of a non-native graph query looking up just one layer of connection – imagine how much the processing time multiplies as you query across more hops.
In addition, reversing the direction of a traversal is extremely difficult with non-native graph processing.
To reverse a query’s direction, you must either create a costly reverse-lookup index for each traversal, or perform a brute-force search through the original index. Both of these workarounds will get you the result you’re looking for – eventually – but they defeat the purpose of using a graph database to begin with: to efficiently query relationships in your data.
The Bottom Line: Why Native vs. Non-Native Matters
When deciding between a native and non-native graph databases, it is important to understand the tradeoffs of working with each.
Non-native graph technology most likely has a persistence layer that your development team is already familiar with (such as Cassandra, MySQL or another relational database), and when your dataset is small or less connected, choosing non-native graph technology isn’t likely to significantly affect the performance of your application.
Ask yourself though, is your data likely to remain small and less connected over time? Probably not.
Datasets tend to growover time, and today’s datasets are more unstructured, interconnected and interrelated than ever before. Even if your dataset is small to begin with, it’s important to plan for the future if your data is likely to grow alongside your business. In this case, a native graph database serves you better over the long-term since the performance of non-native graph processing cripples under larger datasets.
One of the biggest drivers behind moving to a native graph architecture is that it scales. As you add more data to the database, many queries that would slow with size in a non-native graph database remain lithe and speedy in a native context.
Native graph scaling takes advantage of a large number of optimizations in storage and processing to yield a highly efficient approach, whereas non-native uses brute force to solve the problem, requiring more hardware (usually two-four times the amount of hardware or more) and resulting in lower latencies, especially for larger graphs.
Not all applications require low latency or processing efficiency, and in those use cases, a non-native graph database might just do the job. (Really though, why are you using a graph database for these sorts of tasks, then?)
But if your application requires storing, querying and traversing large interconnected datasets in real time for a 24×7, always-on, mission-critical application, then you need a database architecture specifically designed for handling graph data at scale.
The bottom line: The importance of native vs. non-native graph technology depends on the particular needs of your application, but for enterprises hoping to leverage the connections in their data, native graph database technology is critical for success.
This is the end of our Graph Databases for Beginners blog series. We hope you enjoyed it and catch up with any of the posts you missed by exploring the links below. You’re officially no longer a beginner!
Get your copy of the O’Reilly Graph Databases book and explore the endless possibilities of graph technology.
Get My Copy
Catch up with the rest of the Graph Databases for Beginners series:
- Why Graph Technology Is the Future
- Why Connected Data Matters
- The Basics of Data Modeling
- Data Modeling Pitfalls to Avoid
- Why a Database Query Language Matters (More Than You Think)
- Imperative vs. Declarative Query Languages: What’s the Difference?
- Graph Theory & Predictive Modeling
- Graph Search Algorithm Basics
- Why We Need NoSQL Databases
- ACID vs. BASE Explained
- A (Brief) Tour of Aggregate Stores
- Other Graph Technologies.
About the Author
Joy Chao , Community Graphista
Joy Chao is a graphista in the Neo4j community.
She’s currently the Director of Marketing for the Los Angeles Gladiators. In the past, she’s served on a staff wellness program with Pepperdine Human Resources and interned with the Microenterprise Program, an entrepreneurial program for the formerly homeless.
Joy loves learning and new experiences. Her personal projects include achieving a handstand and beginning the art of quilling.