Graph Databases for Beginners: Why a Database Query Language Matters (More Than You Think)


Languages (the natural, human kind) shape how you view the world.

From color to time to gender relations, there’s no escaping how language limits (or expands) your worldview. Words are the categories and labels that we use to process and understand reality – and then to communicate that understanding to others.

So when it comes to analyzing and describing data (a subset of reality), language matters.

in Just like their natural counterparts, technical languages shape how you understand and process your data. If a given programming language or graph query language doesn’t have a label or category or approach to a given data problem, you’ll think about the challenge differently (and subsequently how your application will process it).

Finding the best database for your application or development stack is about more than just features, scalability and performance. While all of those are essential, there’s another backend element that too many architects overlook: the database query language itself.

Learn why your database query language matters because of its (dis-)connection to your data model


In this Graph Databases for Beginners blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. In past weeks, we’ve tackled why graph technology is the future, why connected data matters, the basics of data modeling and how to avoid the most common (and fatal) data modeling mistakes.

This week, we’ll discuss why a database query language matters – even (especially?) if you’re not a developer.

Why We Need Database Query Languages


Up to this point in our beginner’s series, all of our database models have been in the form of diagrams like the one below.

A graph technology data model of a social network


Graph diagrams like this one are perfect for describing a graph database outside of a technology context. However, when it comes to actually using a database, every developer, architect and business stakeholder needs a concrete mechanism for creating, manipulating and querying data. That is, we need a query language.

To use a natural language example, this is the difference between drawing a map (i.e., the process of data modeling) versus asking for turn-by-turn directions, communicating those directions to the driver, pointing out that purple cow on the side of the road, and telling the driver to slam on the brakes before he or she hits aforementioned purple cow (i.e., the capabilities of a query language).

Most relational databases (RDBMS) use a variant of SQL (Structured Query Language), making SQL the de facto database query language amongst most data professionals. But for the most part, SQL – the query language used by developers and data architects – is too arcane and esoteric to be understood by business teams.

This meant that a lot of development time was spent translating business requirements into SQL, and then if a particular query wasn’t possible, that problem had to be retranslated back to the business decision makers in a way they’d understand. The result: A lot of wasted time.

But there is a way to eliminate this back and forth translation (and it doesn’t involve teaching your entire business team to be fluent in SQL): Use a language everyone understands.

Just as graph technology has made the data modeling process more understandable for the uninitiated, so has a graph query language made it easier than ever for the common person to understand and create their own queries.

Why Linguistic Efficiency (& Effectiveness) Matters


If you’re not super technical, you might be wondering why the choice of a database query language matters at all. After all, if query languages are anything like natural human languages, then shouldn’t they all be able to ultimately communicate the same point with just a few differences in phrasing?

The answer is both yes and no.

Query Language Efficiency

Let’s first look at small-scale language efficiency with a few natural language examples.

In English, you might say, “I used to enjoy after-dinner conversation” while reminiscing about your childhood. In Spanish, this same phrase is written as, “Disfrutaba sobremesa.” Both languages express the same idea, but one is far more efficient at communicating it.

Similarly, in English you might want to express, “I love my younger sister as well as my grandmother on my father’s side” (14 words, 70 characters). But in Mandarin Chinese, you could just say, “我爱我的妹妹和奶奶” (six words, nine characters).

When it comes to a database query language, the linguistics of efficiency are similar. A single query in SQL can be many lines longer than the same query in a graph database query language like Cypher. Don’t just take my word for it: Make sure you click that link above and explore the example – it’s just too long to wholly repeat here. (Seriously, I will wait.)

Another aspect of language efficiency to consider: Lengthy queries not only take more time to run, but they are also more likely to include human coding mistakes because of their complexity. Plus, shorter queries increase the ease of understanding and maintenance across your team of developers.

For example, imagine if a new developer had to pick through a long, complicated SQL query and try to figure out the intent of the original developer – trouble would certainly ensue.

But what level of efficiency gains are we talking about between SQL queries and graph queries? How much more efficient is one versus another? The answer: Fast enough to make a significant difference to your business.

The efficiency of graph technology queries means they run in real time, and in an economy that runs at the speed of a single tweet, that’s a bottom-line difference you can’t afford to ignore.

Query Language Effectiveness

Disclaimer: I totally stole this from Ravi Pappu‘s talk at GraphTour DC. (Unfortunately, we weren’t given permission to post the video recording.)

In Eurasia a good while back, humanity had two primary ways of counting: using an abacus or using Hindu-Arabic numeral system (like 1, 2, 3, 4, 5, and so on). In terms of counting and arithmetic, both methods were about equal in terms of their efficiency.

But there’s a reason that we aren’t still using abaci today: Arabic numerals could do more than just count up things. They could be used for so much more.

From algebra to accounting, the Arabic numeral system was far more effective because it could be used to accomplish a far broader set of functions. It was like another language: allowing everyone to process and understand reality in a fundamentally different way.

Abaci were efficient at one particular task (counting), but you couldn’t do algebra with them (or, at least, it would be really time consuming if you tried). The abacus isn’t in the dustbin of history because it wasn’t good at its job (it was), but because it only did one job when the world needed more.

The Intimate Relationship between Data Modeling and Querying


Before diving into the mechanics of a graph database query language below, it’s worth noting that a query language isn’t just about asking (a.k.a. querying) the database for a particular set of results; it’s also about modeling that data in the first place.

We know from previous posts that data modeling for a graph database is as easy as connecting circles and lines on a whiteboard. What you sketch on the whiteboard – including the relationships – is what you store in the database.

On its own, this ease of modeling has many business benefits, the most obvious of which is that you can understand what the hell your database developers are actually creating. But there’s more to it: An intuitive model shaped with the right query language ensures there’s no mismatch between how you built the data and how you analyze it.

A query language represents its model closely. That’s why SQL is all about tables and JOINs while Cypher is about pattern matching relationships between entities. As much as the graph model is more natural to work with, so is Cypher as it borrows from the pictorial representation of circles connected with arrows which even a child understands.

In a relational database, the data modeling process is so far abstracted from actual day-to-day SQL queries that there’s a major disparity between analysis and implementation. In other words, the process of building a relational database model isn’t fit for asking (and answering) questions efficiently from that same model.

And a model mismatch means mental mismatch means wasted time and energy.

Graph database models, on the other hand, not only communicate how your data is related, but they also help you clearly communicate the kinds of questions you want to ask of your data model. Graph models and graph queries are just two sides of the same coin.

The right database query language helps us traverse both sides.

An Introduction to Cypher, the Graph Database Query Language


It’s time to dive into specifics. Most relational databases use a dialect of SQL as their query language, and while the graph database world has a few query languages to choose from, a growing number of vendors and technologies have adopted Cypher as their graph database query language (including Neo4j).

This introduction isn’t a reference document for Cypher but merely a high-level overview.

Cypher is designed to be easily read and understood by developers, database professionals and business stakeholders alike. It’s easy to use because it matches the way we intuitively describe graphs (i.e., the way we intuitively describe data) using whiteboard-like diagrams.

The basic notion of Cypher is that it allows you to ask the database to find data that matches a specific pattern. Colloquially, we might ask the database to “find things like this,” and the way we describe what “things like this” look like is to draw them using ASCII art.

Consider the simple pattern in the figure below.

A graph technology data model of a social network


This graph diagram describes three mutual friends.

If we want to express the pattern of this basic graph in Cypher, we would write:

(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(johan)-[:KNOWS]->(emil) 

This Cypher statement describes a path which forms a triangle that connects an node we call jim to the two nodes we call johan and emil, and which also connects the johan node to the emil node. As you can see, Cypher naturally follows the way we draw graphs on the whiteboard.

Now, while this Cypher pattern describes a simple graph structure it doesn’t yet refer to any particular data in the database. To bind the pattern to specific nodes and relationships in an existing dataset we first need to specify some property values and node labels that help locate the relevant elements in the dataset.

Here’s our more fleshed-out query:

(emil:Person {name:'Emil'})
     <-[:KNOWS]-(jim:Person {name:'Jim'})
     -[:KNOWS]->(johan:Person {name:'Johan'})
     -[:KNOWS]->(emil)

Here we’ve bound each node to its identifier using its name property and Person label. The emil identifier, for example, is bound to a node in the dataset with a label Person and a name property whose value is Emil. Anchoring parts of the pattern to real data in this way is normal Cypher practice.

The Beginner’s Guide to Cypher Clauses


(Disclaimer: This section is still for beginners, but it’s definitely developer-oriented. If you’re just curious about database query languages in general, skip to the “Other Graph Query Languages” section below for a nice wrap-up.)

Like most query languages, Cypher is composed of clauses.

The simplest queries consist of a MATCH clause followed by a RETURN clause. Here’s an example of a Cypher query that uses these three clauses to find the mutual friends of a user named Jim:

MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c), 
      (a)-[:KNOWS]->(c) 
RETURN b, c

Let’s look at each clause in further detail:

MATCH

The MATCH clause is at the heart of most Cypher queries.

Using ASCII characters to represent nodes and relationships, we draw the data we’re interested in. We draw nodes with parentheses, just like in these examples from the query above:

(a:Person {name:'Jim'})
(b)
(c)
(a)

We draw relationships using using pairs of dashes with greater-than or less-than signs (--> and <--) where the < and > signs indicate relationship direction. Between the dashes, relationship names are enclosed by square brackets and prefixed by a colon, like in this example from the query above:

-[:KNOWS]->

Node labels are also prefixed by a colon. As you see in the first node of the query, Person is the applicable label.

(a:Person … )

Node (and relationship) property key-value pairs are then specified within curly braces, like in this example:
( … {name:'Jim'})

In our original example query, we’re looking for a node labeled Person with a name property whose value is Jim. The return value from this lookup is bound to the identifier a. This identifier allows us to refer to the node that represents Jim throughout the rest of the query.

It’s worth noting that this pattern (a)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c) could, in theory, occur many times throughout our graph data, especially in a large user set.

To confine the query, we need to anchor some part of it to one or more places in the graph. In specifying that we’re looking for a node labeled Person whose name property value is Jim, we’ve bound the pattern to a specific node in the graph — the node representing Jim.

Cypher then matches the remainder of the pattern to the graph immediately surrounding this anchor point based on the provided information on relationships and neighboring nodes. As it does so, it discovers nodes to bind to the other identifiers. While a will always be anchored to Jim, b and c will be bound to a sequence of nodes as the query executes.

RETURN

This clause specifies which expressions, relationships and properties in the matched data should be returned to the client. In our example query, we’re interested in returning the nodes bound to the b and c identifiers.

Other Cypher Clauses

Other clauses you use in a Cypher query include:

WHERE
     Provides criteria for filtering pattern matching results.

CREATE and CREATE UNIQUE
     Create nodes and relationships.

MERGE
     Ensures that the supplied pattern exists in the graph, either by reusing existing nodes and relationships that match the supplied predicates, or by creating new nodes and relationships.

DELETE/REMOVE
     Removes nodes, relationships and properties.

SET
     Sets property values and labels.

ORDER BY
     Sorts results as part of a RETURN.

SKIP LIMIT
     Skip results at the top and limit the number of results FOREACH
     Performs an updating action for each element in a list.

UNION
     Merges results from two or more queries.

WITH
     Chains subsequent query parts and forwards results from one to the next. Similar to piping commands in Unix.

If these clauses look familiar – especially if you’re a SQL developer – that’s great! Cypher is intended to be easy-to-learn for SQL veterans while also being easy for beginners. At the same time, Cypher is different enough to emphasize that we’re dealing with graphs, not relational sets.

In addition, Cypher borrows the idea of pattern matching from SPARQL, and some of the collection semantics have been borrowed from languages such as Haskell and Python.

(Click here for the most up-to-date Cypher Refcard to take a deeper dive into the Cypher query language.)

Other Graph Query Languages


Cypher isn’t the only graph database query language (though it’s certainly the dominant one); other graph technologies have their own means of querying data as well. Some support the RDF query language SPARQL (linked above), or the imperative, path-based query language Gremlin.

At the time of this writing, there is also an industry-wide effort to standardize around a single, vendor-neutral graph query language known as GQL.

Conclusion


Not everyone gets hands-on with their database query language on the day-to-day level; however, your down-in-the-weeds development team needs a practical way of modeling and querying data, especially if they’re tackling a graph-based problem.

If your team comes from an SQL background, a query language like Cypher will be easy to learn and even easier to execute. And when it comes to your mission-critical application, you’ll be glad that the language underpinning it all is build for speed and efficiency across connected data.

At this point, it’s worth reflecting. Take a closer look at your data and ask yourself: How would I solve my data challenges differently if my entire approach – vocabulary, syntax, semantics, conceptual model – was distinctly matched to the nature of the challenge?

Don’t be afraid to explore those implications.


Take your first step:
Click below to get your free copy of the O’Reilly Graph Databases book and explore the potential of your connected data.


Download My Free Copy



Catch up with the rest of the Graph Databases for Beginners series: