Demining the “Join Bomb” with Graph Queries


For the past couple of months, and even more so since the beer post, people have been asking me a question that I have been struggling to answer myself for quite some time: What is so nice about the graphs? What can you do with a graph database that you could not, or only at great pains, do in a traditional relational database system?

Conceptually, everyone understands that this is because of the inherent query power in a graph traversal – but how to make this tangible? How to show this to people in a real and straightforward way?

And then Facebook Graph Search came along, along with it’s many crazy search examples – and it sort of hit me: we need to illustrate this with *queries*. Queries that you would not – or only with a substantial amount of effort – be able to do in traditional database system and that are trivial in a graph.

This is what I will be trying to do in this blog post, using an imaginary dataset that was inspired by the Telecommunications industry. The dataset is very simple: a number of “general” data elements (countries, languages, cities), a number of “customer” data elements (person, company) and a number of more telecom-related data elements (phones, conference call service providers and operators – I actually have the full list of all mobile operators in the countries in the dataset coming from here and here).

So to start of with: what would this data set look like in a relational model?

GcZFfg53RxlHWP1taN4h2KlrKEOhk34DCiFSF9T96NXeCgRD2iv6aMBdt3cX_2WNsESfVM3Y7uqC6VkS-u9C6kVq9kgh5hCyZ98uiZNU3_lV9cENRMa2


What is immediately clear is that there is *so* much overhead in the model. In order to query anything meaningful from this normalised RDBMS, you *need* to implement these things called “join tables”. And really: These things stink.

This is just an example of what a poor match the relational model is to the real world – and the complexity it introduces when you start using it.

Compare this to the elegance of the graph model:

80JcMEKM5033a4YSewBgMMFS8UPj4OvCWMN4m5ZN-eoIGPpcVO39xN11BRf3EEmWro0kQ35YBhSazJIBQfniEjNDkVAGpmr-zNVlgKLnrqzGxbAXF8-L


It is such a good match to reality – it is just great. But the beauty is not just in the model – it’s in what you do with the model, in the queries.

So let’s how we could ask some very interesting, Facebook-style queries of this model:
    • Find person in London who speaks more than one language and who owns an iPhone 5
    • Find a city where someone from Neo Technology lives who speaks English and has Three as his operator
    • Find a city where someone from Neo Technology lives who speaks English and has Three as his operator in the city that he lives in
    • Find a person not living in Germany, who roams to more than 2 countries and who emails people who live in London
    • Find a person who roams to more than two countries and who lives in the U.S.A. and uses a Conference Call Service there
These are all very realistic queries that could serve real business purposes (pattern recognition, recommendation engines, fraud detection, new product suggestions, etc.), and that would be terribly ugly to implement in a traditional relational database system, and surprisingly elegant on a graph.

To do that, we’ll use our favourite graph query language, Cypher, to describe our patterns and get the data out.

So let’s explore a couple of examples with some real-world queries.

Graph Queries on a simple telecommunications model from Neo Technology on Vimeo.

The first thing to realise here is the relevance of an important concept in the world of databases, and more specifically so in the world of graph databases: the use of indexes.

In a traditional database, indexes are expensive but indispensable tools to quickly find the appropriate records in a table using a “key”. And when joining two tables, the indexes on both tables would need to be scanned completely and recursively to find *all* the data elements fitting the query criteria.

This is why “joins” are so expensive computationally – and this is also why graph queries are so incredibly fast for join-intensive requests. The thing is, that in a graph database, you *only* use the index on the data *once* at the start of the query – to find your starting points of the “traversals”.

Once you have the starting points, you can just “walk the network” and find the next data element by hopping along the edges/relationships and NOT using any indexes. This is what we call “index-free adjacency” – and it is a fundamental concept in understanding graph traversals.

In the example below, you can see that we are using three index lookups (depicted in green, and I even added a nice little parachute symbol to illustrate what we are doing here) to “parachute” or land into the network and start walking the graph from there.

PArc_CPq5DfqcCdgRwix358vTtn3HS_DhZr7xqwzAjkNSnPCMC26dQbYYWrWlm-D-Kzk8iDRYtI8R6dd28CVZ4JV2ZzuzcS3O4cPtOd95siGb4kjsaIB


The query above is to look for a city where someone from Neo Technology lives that speaks English and has Three as his operator in the city that he lives in.

// These are the three parachutes, landing by doing an index lookup for nodes using the node_auto_index of Neo4j.

START 
neo=node:node_auto_index(name="Neo Technology"),
english=node:node_auto_index(name="English"),
three=node:node_auto_index(name="3")

// Here we describe the pattern that we are looking for. From the three starting points, we are looking for a city that has very specific, directed relationships that need to match this pattern.

MATCH
(person)-[:LIVES_IN]->(city)-[:LOCATED_IN]->(country),
(person)-[:HAS_AS_HOME_OPERATOR]->(three)-[:OPERATES_IN]->(country),
(person)-[:SPEAKS]->(english),
(person)-[:WORKS_FOR]->(neo)

// We return the city’s name and the person’s name as a result set from this query.

RETURN city.name, person.name

// and order them by the name of the city

ORDER BY city.name;

And here’s another example:

ctoGw0jSovwEBMZ98CBFgZoFqCwaNQSKqsLTpcvGRvaUN3aY3BnbXyP2Ce7gulyVt-biVS7KZIzUY8tHhIer_ydWgMT_QBB0pjG9G4O2vC83NP6M8nkK


Here we are looking for two people in the same countries but on different home operators that call, mail or text each other.

// Here we use just one index lookup to find a “country” and then we start looking for the pattern.

START
country=node:node_auto_index(name="Country")

// The pattern in this case is quite a complex one, we quite a few hops on the different relationship types.

MATCH
(samecountry)-[:IS_A]->(country),
(person)-[:LIVES_IN]-()-[:LOCATED_IN]-(samecountry),
(otherperson)-[:LIVES_IN]-()-[:LOCATED_IN]-(samecountry),
(person)-[:HAS_AS_HOME_OPERATOR]->(operator),
(otherperson)-[:HAS_AS_HOME_OPERATOR]->(otheroperator)

// Here we limit the results to a specific condition that has to be applied.

WHERE
(otherperson)-[:CALLS|TEXTS|EMAILS]-(person)
AND operator <> otheroperator

// And here we return the distinct set of name of the person and the countries’ name.

RETURN DISTINCT person.name, samecountry.name;

I hope you can see that these kinds of queries, which directly address the nodes and relationships rather than going through endless join tables, are a much cleaner way to pull this kind of data from the database.

The nice thing about this way of querying is that, in principle, its performance is extremely scalable and constant: We will not suffer the typical performance degradation that relational databases suffer when doing lots of joins over very long tables.

The reason for this is simple: because we only use the indexes to find the starting points, and because the other “joins” will be done implicitly by following the relationships from node to node, we can actually know that performance will remain constant as the dataset grows. Finding the starting point may slow down (a bit) as the index grows, but exploring the network will typically not – as we know that not everything will be connected to everything, and the things unconnected to the starting nodes will simply “not exist” from the perspective of the running query.

Obviously there a lot more things to say about graph queries, but with these simple examples above, I hope to have given you a nice introduction as to where exactly the power of graph traversals is – and it’s in these complex, join-intensive queries.

Yours sincerely,

Rik Van Bruggen


Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.