Getting Started with Data Analysis using Neo4j [Community Post]


[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

Why Are We Doing This?

Data analysis is the phenomenon of dissecting, structuring and understanding data. In a nutshell, we want to find meaning from our data. In this tutorial, we aim to analyze a dataset from issuu.com. The goal is to find answers to a variety of simple and complex questions.

There are a plethora of tools, techniques and methods available to pursue data analysis. We will use Neo4j – a graph database – to represent and visualize the data. It uses a query language called Cypher which allows us to build queries and find all the answers that we seek. By the end of this tutorial, we will be able to import a JSON dataset to Neo4j and comfortably perform queries on our data.

Recommended Reading

Readers are encouraged to read and possess a basic understanding about Neo4j and graph databases prior to starting this tutorial.

I previously published a beginner’s guide to Neo4j – Getting Started with Neo4j – which introduces the concepts of graph databases and elaborates on how to perform basic CRUD (Create, Read, Update, Delete) operations using Neo4j.

Getting Set Up

The first thing we need to do is to download and install Neo4j. We will use Neo4j Desktop which provides a user-friendly UI to visualise the graph and run queries.

The next step is to acquire and understand the dataset!

1. The Dataset


The dataset that we will analyze comes from Issuu – an online repository for magazines, catalogs, newspapers and other publications. They published the Issuu Research Dataset with a treasure of data about documents and visitors. The dataset is completely anonymised and provides an insight into the usage of the website.

The data is available in the JSON format. It can be downloaded and accessed from this GitHub repository. There are two flavors of this file:

  1. A small version – issuu_sample.json (This version of the dataset has 4 entries).
  2. A large version – issuu_cw2.json (This version of the dataset has 10,000 entries).
Both these datasets have been slightly modified for the purpose of this tutorial. To summarize the modification, all JSON entries are now stored in an array and referenced by a key called items.

The dataset is vast and the detailed specification is available here. However, we are interested in the following attributes:

Attribute Purpose
env_doc_id Uniquely identify each document
visitor_uuid Uniquely identify each visitor
visitor_country Two-letter code to identify visitor’s country
event_type Type of action accomplished by visitor on the document

2. Understanding the Graph


Now that we have selected our dataset and cherry-picked the necessary attributes, the next step is to formulate the data as a graph. To create the graph in Neo4j, we need to identify the following elements:
    • Nodes
    • Relationships
    • Properties
From our dataset, we can identify two nodes with the following properties:

Node Properties
Document doc_uuid
Visitor visitor_uuid, country
Note: uuid stands for “Universally Unique IDentifier.”

Tip: A node can also be thought of as a class in object-oriented programming.

What about the relationships? We can create one relationship between the document and visitor nodes.

Relationship Properties
Visitor viewed document type
The relationship viewed is generic in nature. The type property indicates the specific type of event that was accomplished.

As an example, if we consider the visitor Thomas and the document A Diary of Jane, then the relationship can be illustrated as: Thomas viewed A Diary of Jane. However, the type of viewership could be any one of the following:
    • impression
    • click
    • read
    • download
    • share
    • pageread
    • pagereadtime
    • continuation_load
For the purpose of this exercise, we will use a single property on the relationship called as type. The relationship can now be illustrated as: visitor Thomas viewed (and specifically downloaded) document A Diary of Jane.

3. Creating Constraints & Indexes


A constraint is a mechanism to control and ensure data integrity in Neo4j. Constraints can be created on either nodes or relationships. There are basically three types of constraints in Neo4j:

  1. Unique node property constraints – to ensure that the graph contains only a single node with a specific label and property value.
  2. Node property existence constraints – to ensure that a certain property of a specific label exists within all nodes in the graph.
  3. Relationship property existence constraints – to ensure that a certain property exists within all relationships of a specific structure.
  4. Node Keys – to ensure that, for a given label and set of properties, there exists all those properties for that label and that the combination of property values is unique. Essentially, a combination of existence and uniqueness.
We will create unique node property constraints for our graph as follows. On the Document label for the property doc_uuid:

CREATE CONSTRAINT ON (d:Document) ASSERT d.doc_uuid IS UNIQUE

On the visitor label for the property visitor_uuid:

CREATE CONSTRAINT ON (v:Visitor) ASSERT v.visitor_uuid IS UNIQUE

The main goal of this exercise is to query the graph to derive insights about the dataset. One way to improve retrieval efficiency is by using indexes. The idea behind an index here is the same as in relational and NoSQL databases.

In Neo4j, an index can be created on a single property of a label. These are known as single-property indexes. An index can also be created on multiple properties of a label, and these are known as composite indexes.

It is important to understand that, by creating a unique node property constraint on a property, Neo4j will also create a single-property index on that property. Hence, in our situation, the indexes will be created on the doc_uuid property for the document label and on the visitor_uuid property for the visitor label.

Now that we have created the constraints, we can view existing indexes in Neo4j using the query:

CALL db.indexes

It should return:

╒══════════════════════════════════╤══════╤══════════════════════╕
│description                    │state │type                │
╞══════════════════════════════════╪══════╪══════════════════════╡
│INDEX ON :Document(doc_uuid)   │online│node_unique_property│
├──────────────────────────────────┼──────┼──────────────────────┤
│INDEX ON :Visitor(visitor_uuid)│online│node_unique_property│
└──────────────────────────────────┴──────┴──────────────────────┘

Tip: More information about constraints and their effect on indexes is elaborated in the Neo4j documentation on constraints.

4. Importing the JSON Dataset


Let’s now get our hands dirty!

Assuming that Neo4j is started (with an appropriate Database Location selected), we should first see an empty graph. This means that there are no nodes or relationships. Our goal is to populate the graph with the data from the JSON dataset by defining its skeleton (nodes and relationships).

To accomplish this, we will use the concept of user-defined procedures in Neo4j. These procedures are specific functionalities which can be re-used to manipulate the graph. We will specifically use a procedure from the APOC library which is a collection of 200+ commonly used procedures.

On Neo4j Desktop, APOC can be installed with the click of a single button.

To do this, open your project and click on the Manage button. Next, click on the Plugins tab. Under this tab, you will see the APOC heading accompanied with its version and a description. Click on the Install and Restart button. Once APOC is successfully installed, you should see a label with the message ✓ Installed.

Here is a screenshot of a successful APOC installation:

APOC installation on Neo4j


Tip: More information about user-defined procedures and the APOC library (installation, usage, examples) is elaborated in this Neo4j blog article: APOC: An Introduction to User-Defined Procedures and APOC.

We are interested in a specific procedure – apoc.load.json – which will allow us to load data from a JSON document. This procedure will return a singular map if the result is a JSON object or a stream of maps if the result is an array.

The following code snippet illustrates the use of the apoc.load.json procedure with our large dataset – issuu_cw2.json:

WITH "/path/to/issuu_cw2.json" AS url
CALL apoc.load.json(url) 
YIELD value

Now, value contains our data which we need to utilize to create the nodes and relationships. However, we first need to parse this bulk JSON data.

To do this, we can use the UNWIND keyword in Neo4j. In a nutshell, UNWIND will expand a list into a sequence of rows. (More information about UNWIND is available in the Neo4j documentation on Cypher clauses.) At this stage, we are required to possess an understanding of the structure of the JSON data. As mentioned in step 1, we can access the entries using the key items.

This is illustrated as:

UNWIND value.items AS item

At this stage, we have access to a single row via item. Now, we can use the API to access the values.

In Neo4j, the WITH clause allows using the output of a sub-query with following sub-query parts. (More information about WITH is available in the Neo4j documentation on Cypher clauses).

From our data, we are concerned about documents from the Issuu “Reader” software. We uniquely identify these documents using the env_doc_id attribute. However, it is to be noted that not all documents have the env_doc_id attribute. Thus, we are required to explicitly select those entries which possess the attribute. The following code snippet illustrates this:

WITH item
WHERE NOT item.env_doc_id IS NULL

Notice how we access the value using item.env_doc_id. This style of retrieving the value makes working with JSON data on Neo4j a smooth experience.

Now that we have access to the values of each entry, it is time to create the nodes and relationships. This is accomplished using the MERGE keyword in Neo4j. It is crucial to know the difference between CREATE and MERGE.

For example, if we execute the following statements sequentially (provided that the constraints were not created):

CREATE (d:Document {doc_uuid:1}) RETURN (d) 
CREATE (d:Document {doc_uuid:1}) RETURN (d)

This will result in two nodes being created. A way to control this is by using MERGE, which will create the node only if it does not exist. This can be illustrated as follows:

MERGE (d:Document {doc_uuid:1}) RETURN (d)

This same principle can be applied to relationships as well. The final Cypher query consisting of all the above components will look like this:

WITH "https://raw.githubusercontent.com/.../issuu_cw2.json" AS url
CALL apoc.load.json(url) 
YIELD value
UNWIND value.items AS item
WITH item
WHERE NOT item.env_doc_id IS NULL
MERGE (document:Document 
       {doc_uuid:item.env_doc_id})
MERGE (visitor:Visitor 
       {visitor_uuid:item.visitor_uuid}) 
      ON CREATE SET visitor.visitor_country 
                    = item.visitor_country
MERGE (visitor)-[:VIEWED{type:item.event_type}]->(document)

If we run this Cypher query verbatim on Neo4j, the output should be (similar to):

Added 2293 labels, created 2293 nodes, set 5749 properties, 
created 2170 relationships, statement executed in 15523 ms.

To check whether the graph was populated successfully, we can run the Cypher query…

MATCH (n) RETURN (n) LIMIT 200 

…which will only display the top 200 results. The output can be visualized as follows:

Learn how to use the Neo4j graph database for data analysis with this step-by-step tutorial


5. Let’s Query


Finally, we can derive insights from our data.

We need to ask our graph questions. These questions need to be translated to Cypher queries which will return the appropriate results. Let’s start by answering some basic and advanced questions about the dataset.

Note: For the purposes of this tutorial, we will only display the top 10 results for queries with a large number of rows. This is achieved by using the LIMIT 10 constraint.

Query 1. Find the number of visitors from each country and display them in the descending order of count.

MATCH (v:Visitor) 
RETURN v.visitor_country AS Country, count(v) AS Count 
ORDER BY count(v) DESC 
LIMIT 10

Result:

╒════════╤═════╕
│Country│Count│
╞════════╪═════╡
│US     │312  │
├────────┼─────┤
│BR     │143  │
├────────┼─────┤
│MX     │135  │
├────────┼─────┤
│PE     │47   │
├────────┼─────┤
│CA     │46   │
├────────┼─────┤
│ES     │43   │
├────────┼─────┤
│GB     │36   │
├────────┼─────┤
│AR     │35   │
├────────┼─────┤
│FR     │34   │
├────────┼─────┤
│CO     │32   │
└────────┴─────┘

This query simply performs an internal group by operation where visitor nodes are grouped based on the visitor_country property. The count is computed using the count() aggregate function. We sort the results in the descending order using the ORDER BY <column> DESC clause in Neo4j.

Query 2. For a given document, find the number of visitors from each country.

MATCH (d:Document)<-[:VIEWED]-(v:Visitor)
WHERE d.doc_uuid='140228101942-d4c9bd33cc299cc53d584ca1a4bf15d9'
RETURN v.visitor_country AS Country, count(v.visitor_country) AS Count 
ORDER BY count(v.visitor_country) DESC

Result:

╒════════╤═════╕
│Country│Count│
╞════════╪═════╡
│GY     │15   │
├────────┼─────┤
│CA     │12   │
├────────┼─────┤
│US     │11   │
├────────┼─────┤
│CW     │1    │
├────────┼─────┤
│BB     │1    │
└────────┴─────┘

This query is very similar to the one above. Here, we perform an internal group by operation to group the visitor nodes based on the visitor_country property. However, this query differs from the previous one in the sense that we want to filter the counts for a particular document UUID.

In order to achieve this filtration, we need to utilise the relationship within the graph. Hence, we first MATCH, filter using the WHERE clause and then return the desired values.

Tip: The relationship given here…

MATCH (d:Document)<-[:VIEWED]-(v:Visitor) 

...can also be written as…

MATCH (v:Visitor)-[:VIEWED]->(d:Document)

Query 3. Find the number of occurrences for each type of viewership activity.

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)
RETURN r.type AS Type, count(d.doc_uuid) AS Count
ORDER BY Count ASC

Result:

╒═════════════╤══════╕
│Type        │Count│
╞═════════════╪══════╡
│click       │1    │
├─────────────┼──────┤
│read        │62   │
├─────────────┼──────┤
│pageread    │369  │
├─────────────┼──────┤
│pagereadtime│779  │
├─────────────┼──────┤
│impression  │959  │
└─────────────┴──────┘

This query also performs an internal group by operation on the relationship property type. An interesting aspect of this query is the ORDER BY Count ASC. Previously, we followed the style of using ORDER BY count(d.doc_uuid) ASC. However, once we add a column name such as Count, we can use that in subsequent parts of the query.

Hence, ORDER BY count(d.doc_uuid) ASC can also be written as ORDER BY Count ASC.

Query 4. Find the visitors for each document and display the top three in the descending order of number of visitors.

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)
RETURN DISTINCT d.doc_uuid AS DocUUID, collect(DISTINCT v.visitor_uuid) AS Visitors, count(DISTINCT v.visitor_uuid) AS Count
ORDER BY Count DESC
LIMIT 3

Result:

╒═════════════════════════════════╤═════════════════════════════════╤═════╕
│DocUUID                       │Visitors                      │Count│
╞═════════════════════════════════╪═════════════════════════════════╪═════╡
│140224101516-e5c074c3404177518│[4f4bd7a35b20bd1f, 78e1a8af51d│26   │
│bab9d7a65fb578e               │44194, 4d49271019c7ed96, 6b2d3│     │
│                              │cca6c1f8595, 19f5285fef7c1f00,│     │
│                              │ 3819fc022d225057, f102d9d4fc4│     │
│                              │bacdc, e5d957682bc8273b, abefb│     │
│                              │3fe7784f8d3, 6170372b90397fb3,│     │
│                              │ 797846998c5624ca, 43dd7a8b2fa│     │
│                              │fe059, 3ec465aa8f36302b, d6b90│     │
│                              │f07f29781e0, 7bd813cddec2f1b7,│     │
│                              │ 3db0cb8f357dcc71, e1bfcb29e0f│     │
│                              │3664a, 6d87bcdc5fa5865a, b0ba1│     │
│                              │42cdbf01b11, 0930437b533a0031,│     │
│                              │ e3392e4a18d3370e, ee14da6b126│     │
│                              │3a51e, 502ddaaa898e57c4, 6fd04│     │
│                              │0328d2ad46f, 23f8a503291a948d,│     │
│                              │ 923f25aa749f67f6]            │     │
├─────────────────────────────────┼─────────────────────────────────┼─────┤
│140228202800-6ef39a241f35301a9│[2f21ee71e0c6a2ce, 55ac6c3ce63│25   │
│a42cd0ed21e5fb0               │25228, e8fa4a9e63248deb, b2a24│     │
│                              │f14bb5c9ea3, d2d6e7d1a25ee0b0,│     │
│                              │ 6229cca3564cb1d1, 13ca53a93b1│     │
│                              │594bf, 47d2608ec1f9127b, 4e2a7│     │
│                              │5f30e6b4ce7, 43b59d36985d8223,│     │
│                              │ 355361a351094143, 51fd872df55│     │
│                              │686a5, 2f63e0cca690da91, febc7│     │
│                              │86c33113a8e, 52873ed85700e41f,│     │
│                              │ ca8079a4aaff28cb, 17db86d2605│     │
│                              │43ddd, b3ded380cc8fdd24, b6169│     │
│                              │f1bebbbe3ad, 458999cbf4307f34,│     │
│                              │ 280bd96790ade2d4, 32563acf872│     │
│                              │f5449, fabc9339a406616d, 36a12│     │
│                              │501ee94d15c, 6d3b99b2041af286]│     │
├─────────────────────────────────┼─────────────────────────────────┼─────┤
│140228101942-d4c9bd33cc299cc53│[b1cdbeca3a556b72, a8cf3c4f144│24   │
│d584ca1a4bf15d9               │9cc5d, 1435542d699350d9, 06d46│     │
│                              │5bfb51b0736, 2d41536695cc4814,│     │
│                              │ a96854d21780c1f9, d1c98b02398│     │
│                              │e9677, 78deb8ffdb03d406, 6c661│     │
│                              │964d1d13c61, 0d47795fb1ddba9d,│     │
│                              │ 667283570b5cedfe, 5b2baf03296│     │
│                              │63564, 08c069dc405cad2e, 6823c│     │
│                              │573efad29f6, 9b2cb60327cb7736,│     │
│                              │ 0e8ddc2d2a60e14f, f5986e1cb02│     │
│                              │378e4, fa3810e505f4f792, d5ed3│     │
│                              │cfc4a454fe9, ba76461cdd66d337,│     │
│                              │ ee42ba15ed8618eb, 688eb0dcd6a│     │
│                              │d8c86, 67c698e88b4fbdcc, c97c3│     │
│                              │83d774deae0]                  │     │
└─────────────────────────────────┴─────────────────────────────────┴─────┘

This query utilizes the collect() aggregate function which groups multiple records into a list. An important consideration made here is the use of the DISTINCT operator to ensure that duplicate values are omitted from the output. Finally, we display the top three using the LIMIT 3 constraint.

Query 5. For a given document, find recommendations of other documents like it.

Example 1: document UUID = 130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document) 
WHERE d1<>d 
      AND 
      d.doc_uuid='130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8'
RETURN d1 AS Recommendations, count(*) AS Views, 
sum(
CASE r1.type 
  WHEN "impression" THEN 1 
  WHEN "pageread" THEN 1.5 
  WHEN "pagereadtime" THEN 1.5 
  WHEN "read" THEN 2 
  WHEN "click" THEN 0.5 
  ELSE 0 
END
) as Score
ORDER BY Score DESC

Result:

╒═════════════════════════════════╤═════╤═════╕
│Recommendations               │Views│Score│
╞═════════════════════════════════╪═════╪═════╡
│{doc_uuid: 130810070956-4f21f4│12   │16   │
│22b9c8a4ffd5f62fdadf1dbee8}   │     │     │
└─────────────────────────────────┴─────┴─────┘

Example 2: document UUID = 120831070849-697c56ab376445eaadd13dbb8b6d34d0

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document) 
WHERE d1<>d 
      AND 
      d.doc_uuid='120831070849-697c56ab376445eaadd13dbb8b6d34d0'
RETURN d1 AS Recommendations, count(*) AS Views, 
sum(
CASE r1.type 
  WHEN "impression" THEN 1 
  WHEN "pageread" THEN 1.5 
  WHEN "pagereadtime" THEN 1.5 
  WHEN "read" THEN 2 
  WHEN "click" THEN 0.5 
  ELSE 0 
END
) as Score
ORDER BY Score DESC

Result:

╒═════════════════════════════════╤═════╤═════╕
│Recommendations               │Views│Score│
╞═════════════════════════════════╪═════╪═════╡
│{doc_uuid: 130701025930-558b15│6    │6    │
│0c485fc8928ff65b88a6f4503d}   │     │     │
├─────────────────────────────────┼─────┼─────┤
│{doc_uuid: 120507012613-7006da│6    │6    │
│2bc335425b93d347d2063dc373}   │     │     │
└─────────────────────────────────┴─────┴─────┘

This query aims to find documents similar to a given document by assigning a score based on the type of viewership activity.

Activity Score
impression 1
pageread 1.5
pagereadtime 1.5
read 2
click 0.5
First, we perform a MATCH operation to capture the 1st degree and 2nd degree viewership of a visitor node along with the document nodes and relationships. We ensure that two documents nodes are not the same by using the <> operator and also specify the initial document UUID for which we would like to find related documents.

Next, we simply return the recommended document UUIDs, their overall viewership counts and their score. To calculate the score, we utilize the CASE expression which is then tallied using the sum() aggregate function.

The CASE expression has the following syntax:

CASE 
 WHEN  THEN  //if value matches, then return result
 [WHEN ...] //repeat until all values are handled
 [ELSE ] //else return a default result
END

Finally, we sort the results in the descending order of score using the ORDER BY Score DESC clause!

Summary

In this tutorial, we saw an example of performing data analysis using Neo4j. We examined the Issuu Research Dataset and elaborated on its structure, format and fields.

Next, we formulated the model/schema of our desired graph by choosing appropriate nodes, relationships and properties. Further, we created constraints and indexes in Neo4j to ensure uniqueness and improve the performance of querying.

After this, we discussed how to import the raw JSON dataset, parse it and populate our graph by following the previously determined schema. Lastly, we saw some sample Cypher queries which helped us derive insight from our vast dataset.

What's Next?

The possibilities are endless! If you enjoyed this tutorial, then you can try to derive data analysis insights using another dataset of your choice.

While we chose to construct a rather simple graph, you can make it much more complex and detailed. In addition to this, you can also explore and experiment with the various APOC user defined procedures on the graph.

References


Want to learn more about what you can do with graph databases? Click below to get your free copy of the O’Reilly Graph Databases book and discover how to harness the power of connected data.

Download My Free Copy