Goals This article explores the differences between relational and graph databases and data models. In addition, it explains how to integrate graph databases with relational databases and how to import data from a relational store. Prerequisites You should have looked… Read more →
This article explores the differences between relational and graph databases and data models. In addition, it explains how to integrate graph databases with relational databases and how to import data from a relational store.
You should have looked into the property-graph model and optionally read data modeling with graphs.
Relational databases have been the power-horse of software applications since the 80s, and continue so to this day. They store highly structured data in tables with predetermined columns of certain types and many rows of the same type of information, and, thanks in part to the rigidity of their organization, require developers and applications to strictly structure the data used in their applications.
In relational databases, references to other rows and tables are indicated by referring to their (primary-)key attributes via foreign-key columns. This is enforceable with constraints, but only when the reference is never optional. Joins are computed at query time by matching primary- and foreign-keys of the many (potentially indexed) rows of the to-be-joined tables. These operations are compute- and memory-intensive and have an exponential cost.
If you use many-to-many relationships, you have to introduce a JOIN table (or junction table) that holds foreign keys of both participating tables which further increases join operation costs. Those costly join operations are usually addressed by denormalising data to reduce the number of joins necessary.
Although not every use-case is a good fit for this type of stringent data model, in the past, the lack of viable alternatives and the great support for relational databases has made it difficult for alternative models to break into the mainstream.
Meet graph databases.
From Relational to Graph Databases
Relationships are first-class citizens of the graph data model, unlike other database management systems, which require us to infer connections between entities using special properties such as foreign keys, or out-of-band processing like map-reduce. By assembling the simple abstractions of nodes and relationships into connected structures, graph databases enable us to build sophisticated models that map closely to our problem domain.
In some regards, graph databases are like the next generation of relational databases, but with first class support for “relationships”, or those implicit connections indicated via foreign-keys in the traditional relational databases.
Each node (entity or attribute) in the graph database model directly and physically contains a list of relationship-records that represent its relationships to other nodes. These relationship records are organized by type and direction and may hold additional attributes. Whenever you run the equivalent of a JOIN operation, the database just uses this list and has direct access to the connected nodes, eliminating the need for a expensive search / match computation.
This ability of pre-materializing relationships into database structures allows Neo4j to provide performances of several orders of magnitude, especially for join heavy queries, the minutes to milliseconds advantage that many users leverage.
The resulting data models are much simpler and at the same time more expressive than those produced using traditional relational or other NoSQL databases.
Graph databases support a very flexible and fine-grained data model that allows you to model and manage rich domains in an easy and intuitive way.
You more or less keep the data as it is in the real world: small, normalized, yet richly connected entities. This allows you to query and view your data from any imaginable point of interest, supporting many different use-cases.
The fine-grained model also means that there is no fixed boundary around aggregates, so the scope of update operations is provided by the application during the read or write operation. The well-known and tested concept of transactions groups a set of updates of nodes and relationships into an atomic, consistent, isolated, and durable (ACID) operation. Graph databases like Neo4j fully support the transactional concepts including write-ahead logs and recovery after abnormal termination. So you never lose your data that has been comitted to the database.
If you’re used to modeling with relational databases, remember the ease and beauty of a well done, normalized entity-relationship diagram: a simple, easy to understand model you can quickly whiteboard with your colleagues and domain experts. A graph is exactly that, a clear model of the domain, focused on the use-cases you want to efficiently support.
Let’s take a model of the organizational domain and show how it would be modeled in a relational database vs. the graph database:
Data Model Transformation
Here are some tips that help you with the transformation:
- Each entity table is represented by a label on nodes
- Each row in a entity table is a node
- Columns on those tables become node properties.
- Remove technical primary keys, keep business primary keys
- Add unique constraints for business primary keys, add indexes for frequent lookup attributes
- Replace foreign keys with relationships to the other table, remove them afterwards
- Remove data with default values, no need to store those
- Data in tables that is denormalized and duplicated might have to be pulled out into separate nodes to get a cleaner model.
- Indexed column names, might indicate an array property (like email1, email2, email3)
- Join tables are transformed into relationships, columns on those tables become relationship properties
It is important to have an understanding of the graph model before you start to import data, then it just becomes the task of hydrating that model.
Working with Neo4j
Querying relational databases is easy with SQL; a declarative query language that allows both for easy ad-hoc querying in a database tool as well as specifying use-case related queries in your code. Even object-relational mappers use SQL under the hood to talk to the database.
Do graph databases have something similar? Cypher, Neo4j’s declarative graph query language, is built on the basic concepts and clauses of SQL but has a lot of additional graph-specific functionality to make it simple to work with your rich graph model without being too verbose. It allows you to query and update the graph structures, with concise statements. Cypher is centered around the graph patterns that are core to your use-cases and represents them visually as part of its query syntax.
If you have ever tried to write a SQL statement with a large number of joins, you know that you quickly lose sight of what the query actually does, due to all the technical noise.
In Cypher the syntax stays clean and focused on domain concepts as the structural connections to find or create are expressed visually. The other clauses besides the pattern matching should be very familiar for everyone who has used SQL before.
In the organizational domain depicted in the model above – what would a SQL statement that lists the employees in the “IT Department” look like, and how does that statement compare to a Cypher statement?
Connecting to Neo4j
If you’ve installed and started Neo4j as a server on your system, you can interact with the database with the built-in Neo4j browser application (like sql-plus on steroids).
If you want to access Neo4j programmatically, you would do so with the integrated HTTP API, which allow you to:
- POST one or more Cypher statements with parameters per request to the server
- Keep transactions open over multiple requests
- Choose different result formats
A sample HTTP request that executes Cypher to create a Person would look like this. You can run it directly from the Neo4j browser, here shown with the plain JSON response.
Of course, you don’t want to connect to Neo4j manually, but with a driver or connector library designed for your stack or programing language. Thanks to the Neo4j community, there are drivers for Neo4j for almost all popular programing languages, most of which mimic existing database driver idioms and approaches.
For instance, the Neo4j JDBC driver would be used like this to query the database for Johns departments:
Importing Data from a Relational Database
When you have a good enough understanding of the shape of your graph model, i.e. what data will be represented as nodes or relationships and how the labels, relationship-types, and attributes are named, you’re ready to go.
The easiest way to import data from your relational database is to create a CSV dump of individual entity-tables and join-tables.
Then you can take the CSV file(s) and use Cypher’s
LOAD CSV power tool to:
- Ingest the data, accessing columns by header name or offset
Convert values from strings to different formats and structures (
- Skip rows to be ignored
MATCHexisting nodes based on attribute lookups
MERGEnodes and relationships with labels and attributes from the row data
SETnew labels and properties or
You can import multiple CSV files from one or more data sources to enrich your core domain model with other information that might add interesting insights and capabilities.
Other, dedicated import tools, help you importing larger volumes (10M+ rows) of data efficiently, as described in the csv-import guide.