First let’s get the apologies out of the way. This article is longer than I first thought it would be. Much longer in fact. Originally I had intended to write a “How to” article, outlining the way I linked SQL Server to a graph database, Neo4j. Then I sat back and thought further. Most dyed-in-the-wool SQL Server aficionados will consider heresy the very idea of using a graph database, or any NOSQL technology for that matter. “SQL Server can do anything” they will say. “Why do I need this non-relational stuff?” “Why should I bother reading this article?”
Yes, SQL Server is incredibly capable at processing data, in huge quantities, and at prodigious speeds (when well-tuned). So I came to the conclusion that as well as “How”, I needed to address the “What” and the “Why”. What problem is so intractable that SQL Server is not the best solution and what alternative technologies might be available? Why did I choose a graph database to solve my problem? Only then I could get to the heart of the issue and describe how I built the solution.
What is the problem?
SQL Server is not necessarily the most efficient way of solving every problem. In particular, I have always found difficulties in handling hierarchical data. Saying there are problems with the handling of relationships between data items seems paradoxical when dealing with a relational database, but just think carefully about the situation. The relations in a relational database are handled by joins between tables. These can be managed in two ways: either in a permanent declarative fashion by the use of foreign keys, or in a temporary fashion by ad-hoc joins in our queries. In either case, these are very efficient in most cases.
Hierarchical data on the other hand, is often modelled with self joins. Consider the typical example of an employee table which has a unique identifier (employee_ID) and a foreign key field holding the employee_ID of the person’s manager, who is themselves an employee. This can be used to create a recursive CTE to return an organisations management structure. But this is a very simple tree structure. An employee is only directly linked to one manager, and the flow of responsibility is in a single direction. From SQL Server 2005, Microsoft has also provided the hierarchy data type to model such structures.
Hierarchical data is often not that simple. For example, an employee might be managerially responsible to one supervisor, but professionally responsible to another. The hierarchy data type cannot handle multiple parents, so is not appropriate for such data. Concepts might have bi-directional links. Consider a social networking example. John might “follow” Paul, but Paul may or may not follow John. Again, the hierarchy data type cannot handle these multi-directional relationships. Such social networking situations are not simple trees as with the employee-manager scenario, but are a more generalised structure – a graph. Graphs can be used to model complex interactions where concepts are linked by various types of relationships, with multiple parents and multiple children.
Graphs can be implemented in a relational database using foreign keys as in the employee-manager example, but often need link tables to model the complexities, and whereas the employee-manager trees tends to be of limited depth, social networking graphs can extend to an almost infinite depth and even be circular in nature. The execution plans for the queries needed to model these structures become increasingly complex, and as the depth of traversal of the graph or tree increases, the queries become ever more inefficient.
My own business domain is that of health care, and for many years I was the head of a software development team developing an electronic health record for cancer patients. True to the spirit of using coded data wherever possible, diagnoses were stored using READ Codes V2. This was a UK oriented coding system containing a thesaurus of over 85,000 concepts plus a further 40,000 for drugs and medical devices.
The codes themselves were 5-byte alphanumeric sequences, and the classification of the terminology was inherent in the structure of the codes. Thus code B1… was the code for the concept “Malignant neoplasm of digestive organs and peritoneum”, and child concepts representing more specific areas of the digestive tract were coded by adding additional characters. So B10.. represented “Malignant neoplasm of oesophagus”, B11.. represented “Malignant neoplasm of stomach” and so on. Similarly, B100. was “Malignant neoplasm of cervical oesophagus” and B101. Was “Malignant neoplasm of thoracic oesophagus”. This meant that to find out how many patients were suffering from malignant disease of some part of the oesophagus, it was sufficient to include the following in the query…
SELECT COUNT(*)
FROM DISEASES
WHERE READCODE LIKE ‘B10%’
This was simple, straightforward, and more to the point, very fast.
Unfortunately there is a new coding scheme which is being adopted as a National Standard by the NHS, and as Murphy’s Law dictates, it is not that simple any more. The new coding scheme, SnomedCT, has many advantages, not the least of which is having a much larger thesaurus of concepts (over 700,000). It also has one other feature which has both pluses and minuses; its codes are, to all intents and purposes, random or surrogate keys. ConceptIDs are large integers, normally represented as varchar(18) in SQL Server, and relationships between them are defined using a Relationship table which, among others, has fields for ConceptID1, ConceptID2, and RelationshipType. This is very flexible, allowing concepts to be represented in multiple hierarchical position, to have multiple parents, and to have many types of relationships apart from the concept_1 IS_A concept_2 type seen above.
In fact there are over 3,000,000 relationships defined of 81 different relationship types, including IS_A, WAS_A, SAME_AS, Severity, Laterality, Finding_context etc., at a depth of up to 14 levels from the root concept. Although this allows a much greater accuracy in coding medical situations, you can no longer do a simple count “WHERE CONCEPTID LIKE ‘371984007’ (the concept id for Malignant neoplasm of oesophagus), as the query produces meaningless results.
It is possible to get the information using recursive CTEs. For example, a query to return all the child concepts of ‘371984007’ (the concept id for Malignant neoplasm of oesophagus) is as follows:
WITH children (ConceptID, FullySpecifiedName, Level) AS
(
SELECT c1.ConceptID, c1.FullySpecifiedName, 1
FROM dbo.Concepts c
INNER JOIN Relationships r ON r.ConceptID2=c.ConceptID
AND r.RelationshipType='116680003' -- conceptID for 'IS_A' relationship
INNER JOIN dbo.Concepts c1 ON c1.ConceptID=r.ConceptID1
WHERE c.ConceptID='371984007' -- conceptID for malignant neoplasm of oesophagus
UNION ALL
SELECT c1.ConceptID, c1.FullySpecifiedName, c.Level+1
FROM children c
INNER JOIN Relationships r ON r.ConceptID2=c.ConceptID
AND r.RelationshipType='116680003' -- conceptID for 'IS_A' relationship
INNER JOIN dbo.Concepts c1 ON c1.ConceptID=r.ConceptID1
)
SELECT DISTINCT * FROM children
This query yields thirteen concepts over three levels. However, once one starts probing relationships of many more levels, or when one needs ancestor codes as well as descendent codes, the SQL becomes increasingly inefficient, and we need to consider other solutions.
Why choose a graph database?
It is at this point that we can look at NOSQL solutions. (As an aside, I like to think of NOSQL as meaning
Not
Only
SQL rather than
No SQL). NOSQL technologies come in many flavours, document databases, key-value stores, and graph databases to name but a few. It is the latter that I shall consider here. Graph databases have been designed from the ground up to handle the type of data I refer to above – graph data! Rather than reproduce already published material, I recommend reading “Graph Databases”, by Robinson, Weber and Eifrem (O’Reilly, 2013, available as a free download from
https://graphdatabases.com/ ) which discusses in detail what graph databases are, what applications they are best suited for, and how they work. I will extract a few salient point though. Graph databases represent entities or concepts as nodes and the ways in which those entities relate to the world ad to each other as relationships. Their storage engines draw on algorithms developed in the mathematics of graph theory to solve graph problems more efficiently than a relational storage engine can. Unlike relational databases, relationships in graph databases are real entities and do not have to be inferred from foreign keys.
The graph database I have used is Neo4j from Neo Technologies (
https://www.neotechnology.com). Unlike many NOSQL solutions, Neo4j is suitable for enterprise deployment as it features full ACID transactions, and scales to many billions of nodes and relationships through its high availability options. Neo4j is a property graph, and has the following characteristics:
- It contains nodes and relationships
- Nodes contain properties (key-value pairs)
- Relationships are named and directed, and always have a start and end node
- Relationships can also contain properties
For an excellent introduction to Neo4j (and graph databases in general), see the Neo4j Manual (
https://docs.neo4j.org/chunked/2.0.0-M06/index.html).
There is a body of published material discussing the relative merits of SQL and NOSQL database engines. One comparison which is relevant here is contained in the book “Neo4j in Action”, by Partner, Vukotic and Watt (ISBN: 9781617290763, Manning Early Access Program, 2013,
https://www.manning.com/partner/ ). This book considers a simple social networking database with two tables, t_user, and t_user_friend. The latter has two fields for user_1 and user_2 which are both foreign keyed into the t_user table. The t_user table is populated with 1000 users, and each has an average of 50 friends, so the t_user_friend table has 50000 records. They then query this table for friends of friends for up to 5 depths and found query times of about 10 seconds for depth 4, and over 90000 seconds for depth 5.
In my own experiments on SQL Server 2012, the query times were not as bad as this, although there was still a significant deterioration in performance at depth 5. However, I did modify the query slightly. The book used a SELECT DISTINCT … whereas I just used a SELECT INTO a temp table and then a SELECT DISTINCT out of the temp table. This showed such a marked improvement over the original that I am still surprised that the query optimiser can’t do something similar! The book goes on to demonstrate sub-second query times for similar data in Neo4j, and I also found this to be the case.
This data clearly demonstrates an advantage in using a graph database such as Neo4j to handle complex relationships between data items. Neo4j has its own query language Cypher, which broadly follows a SQL-like syntax. The following Cypher query will return the child concepts of Snomed CT concept ID=’371984007′, the conceptID for malignant neoplasm of oesophagus used in the examples above:
match p=(child:Concept)-[IS_A*]->(parent:Concept)
where parent.conceptID='371984007'
return distinct
child.conceptID as conceptId,
child.fullySpecifiedName as fullySpecifiedName,
length(p) as level
Cypher uses ASCII art in its pattern matching, where () represents a node and [] represents a relationship. So (child:Concept) is a node of type Concept given the local identifier of child, and -[IS_A*]-> represents a directed relationship of any depth of type IS_A between the node with the identifier child and the node with the identifier parent. The identifier p holds a reference to the entire matched path. The where clause is similar to SQL and tells the Neo4j engine to start matching paths from a parent node with the indicated value for the conceptID property. The return clause is similar to the SQL select, and instructs the engine to return the listed properties of each child node; the length(p) is the number of relationship steps in the overall path.
The nodes and relationships returned from the Cypher query are illustrated graphically in figure 1, which is a screen clippping from the Neo4j browser. Note that in this figure, the numbers on the nodes are integer node IDs assigned by Neo4j when the node is created. These can be retrieved in Cypher queries using the syntax id(node_Identifier).
However, this in itself is not the whole answer to the problem. The Cypher query against Neo4j will give me the list of disease codes I am interested in. But the rest of my data is not stored in Neo4j, nor would I want it to be. In accepting that the graph database is a better solution for handling my medical terminology hierarchies, I am not saying it is better for everything else! And this will often be the case.
For example, in an ecommerce application, a graph database may well be an efficient way of identifying recommended products (you know, “people who bought what you have just added to your shopping cart also purchased this, that and the other”). But you probably don’t want to use the graph database for all the inventory, shopping cart, user base, and credit card details. The relational model has a proven track record there at least.
Read the full article.