[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]
Why Are We Doing This?
Data analysis is the phenomenon of dissecting, structuring and understanding data. In a nutshell, we want to find meaning from our data. In this tutorial, we aim to analyze a dataset from issuu.com. The goal is to find answers to a variety of simple and complex questions.There are a plethora of tools, techniques and methods available to pursue data analysis. We will use Neo4j – a graph database – to represent and visualize the data. It uses a query language called Cypher which allows us to build queries and find all the answers that we seek. By the end of this tutorial, we will be able to import a JSON dataset to Neo4j and comfortably perform queries on our data.
Recommended Reading
Readers are encouraged to read and possess a basic understanding about Neo4j and graph databases prior to starting this tutorial.I previously published a beginner’s guide to Neo4j – Getting Started with Neo4j – which introduces the concepts of graph databases and elaborates on how to perform basic CRUD (Create, Read, Update, Delete) operations using Neo4j.
Getting Set Up
The first thing we need to do is to download and install Neo4j. We will use Neo4j Desktop which provides a user-friendly UI to visualise the graph and run queries.The next step is to acquire and understand the dataset!
1. The Dataset
The dataset that we will analyze comes from Issuu – an online repository for magazines, catalogs, newspapers and other publications. They published the Issuu Research Dataset with a treasure of data about documents and visitors. The dataset is completely anonymised and provides an insight into the usage of the website.
The data is available in the JSON format. It can be downloaded and accessed from this GitHub repository. There are two flavors of this file:
- A small version –
issuu_sample.json
(This version of the dataset has4
entries). - A large version –
issuu_cw2.json
(This version of the dataset has10,000
entries).
items
.The dataset is vast and the detailed specification is available here. However, we are interested in the following attributes:
Attribute | Purpose |
---|---|
env_doc_id |
Uniquely identify each document |
visitor_uuid |
Uniquely identify each visitor |
visitor_country |
Two-letter code to identify visitor’s country |
event_type |
Type of action accomplished by visitor on the document |
2. Understanding the Graph
Now that we have selected our dataset and cherry-picked the necessary attributes, the next step is to formulate the data as a graph. To create the graph in Neo4j, we need to identify the following elements:
- Nodes
- Relationships
- Properties
Node | Properties |
---|---|
Document |
doc_uuid |
Visitor |
visitor_uuid , country |
uuid
stands for “Universally Unique IDentifier.”Tip: A node can also be thought of as a class in object-oriented programming.
What about the relationships? We can create one relationship between the
document
and visitor
nodes.Relationship | Properties |
---|---|
Visitor viewed document |
type |
viewed
is generic in nature. The type
property indicates the specific type of event that was accomplished.As an example, if we consider the visitor Thomas and the document A Diary of Jane, then the relationship can be illustrated as: Thomas viewed A Diary of Jane. However, the type of viewership could be any one of the following:
impression
click
read
download
share
pageread
pagereadtime
continuation_load
type
. The relationship can now be illustrated as: visitor Thomas viewed (and specifically downloaded) document A Diary of Jane.3. Creating Constraints & Indexes
A constraint is a mechanism to control and ensure data integrity in Neo4j. Constraints can be created on either nodes or relationships. There are basically three types of constraints in Neo4j:
- Unique node property constraints – to ensure that the graph contains only a single node with a specific label and property value.
- Node property existence constraints – to ensure that a certain property of a specific label exists within all nodes in the graph.
- Relationship property existence constraints – to ensure that a certain property exists within all relationships of a specific structure.
- Node Keys – to ensure that, for a given label and set of properties, there exists all those properties for that label and that the combination of property values is unique. Essentially, a combination of existence and uniqueness.
Document
label for the property doc_uuid
:CREATE CONSTRAINT ON (d:Document) ASSERT d.doc_uuid IS UNIQUE
On the
visitor
label for the property visitor_uuid
:CREATE CONSTRAINT ON (v:Visitor) ASSERT v.visitor_uuid IS UNIQUE
The main goal of this exercise is to query the graph to derive insights about the dataset. One way to improve retrieval efficiency is by using indexes. The idea behind an index here is the same as in relational and NoSQL databases.
In Neo4j, an index can be created on a single property of a label. These are known as single-property indexes. An index can also be created on multiple properties of a label, and these are known as composite indexes.
It is important to understand that, by creating a unique node property constraint on a property, Neo4j will also create a single-property index on that property. Hence, in our situation, the indexes will be created on the
doc_uuid
property for the document
label and on the visitor_uuid
property for the visitor
label.Now that we have created the constraints, we can view existing indexes in Neo4j using the query:
CALL db.indexes
It should return:
╒══════════════════════════════════╤══════╤══════════════════════╕ │description │state │type │ ╞══════════════════════════════════╪══════╪══════════════════════╡ │INDEX ON :Document(doc_uuid) │online│node_unique_property│ ├──────────────────────────────────┼──────┼──────────────────────┤ │INDEX ON :Visitor(visitor_uuid)│online│node_unique_property│ └──────────────────────────────────┴──────┴──────────────────────┘
Tip: More information about constraints and their effect on indexes is elaborated in the Neo4j documentation on constraints.
4. Importing the JSON Dataset
Let’s now get our hands dirty!
Assuming that Neo4j is started (with an appropriate
Database Location
selected), we should first see an empty graph. This means that there are no nodes or relationships. Our goal is to populate the graph with the data from the JSON dataset by defining its skeleton (nodes and relationships).To accomplish this, we will use the concept of user-defined procedures in Neo4j. These procedures are specific functionalities which can be re-used to manipulate the graph. We will specifically use a procedure from the APOC library which is a collection of 200+ commonly used procedures.
On Neo4j Desktop, APOC can be installed with the click of a single button.
To do this, open your project and click on the
Manage
button. Next, click on the Plugins
tab. Under this tab, you will see the APOC heading accompanied with its version and a description. Click on the Install and Restart
button. Once APOC is successfully installed, you should see a label with the message ✓ Installed
.Here is a screenshot of a successful APOC installation:
Tip: More information about user-defined procedures and the APOC library (installation, usage, examples) is elaborated in this Neo4j blog article: APOC: An Introduction to User-Defined Procedures and APOC.
We are interested in a specific procedure –
apoc.load.json
– which will allow us to load data from a JSON document. This procedure will return a singular map if the result is a JSON object or a stream of maps if the result is an array.The following code snippet illustrates the use of the
apoc.load.json
procedure with our large dataset – issuu_cw2.json
:WITH "/path/to/issuu_cw2.json" AS url CALL apoc.load.json(url) YIELD value
Now,
value
contains our data which we need to utilize to create the nodes and relationships. However, we first need to parse this bulk JSON data.To do this, we can use the
UNWIND
keyword in Neo4j. In a nutshell, UNWIND
will expand a list into a sequence of rows. (More information about UNWIND
is available in the Neo4j documentation on Cypher clauses.) At this stage, we are required to possess an understanding of the structure of the JSON data. As mentioned in step 1, we can access the entries using the key items
.This is illustrated as:
UNWIND value.items AS item
At this stage, we have access to a single row via
item
. Now, we can use the API to access the values.In Neo4j, the
WITH
clause allows using the output of a sub-query with following sub-query parts. (More information about WITH
is available in the Neo4j documentation on Cypher clauses).From our data, we are concerned about documents from the Issuu “Reader” software. We uniquely identify these documents using the
env_doc_id
attribute. However, it is to be noted that not all documents have the env_doc_id
attribute. Thus, we are required to explicitly select those entries which possess the attribute. The following code snippet illustrates this:WITH item WHERE NOT item.env_doc_id IS NULL
Notice how we access the value using
item.env_doc_id
. This style of retrieving the value makes working with JSON data on Neo4j a smooth experience.Now that we have access to the values of each entry, it is time to create the nodes and relationships. This is accomplished using the
MERGE
keyword in Neo4j. It is crucial to know the difference between CREATE
and MERGE
.For example, if we execute the following statements sequentially (provided that the constraints were not created):
CREATE (d:Document {doc_uuid:1}) RETURN (d) CREATE (d:Document {doc_uuid:1}) RETURN (d)
This will result in two nodes being created. A way to control this is by using
MERGE
, which will create the node only if it does not exist. This can be illustrated as follows:MERGE (d:Document {doc_uuid:1}) RETURN (d)
This same principle can be applied to relationships as well. The final Cypher query consisting of all the above components will look like this:
WITH "https://raw.githubusercontent.com/.../issuu_cw2.json" AS url CALL apoc.load.json(url) YIELD value UNWIND value.items AS item WITH item WHERE NOT item.env_doc_id IS NULL MERGE (document:Document {doc_uuid:item.env_doc_id}) MERGE (visitor:Visitor {visitor_uuid:item.visitor_uuid}) ON CREATE SET visitor.visitor_country = item.visitor_country MERGE (visitor)-[:VIEWED{type:item.event_type}]->(document)
If we run this Cypher query verbatim on Neo4j, the output should be (similar to):
Added 2293 labels, created 2293 nodes, set 5749 properties, created 2170 relationships, statement executed in 15523 ms.
To check whether the graph was populated successfully, we can run the Cypher query…
MATCH (n) RETURN (n) LIMIT 200
…which will only display the top 200 results. The output can be visualized as follows:
5. Let’s Query
Finally, we can derive insights from our data.
We need to ask our graph questions. These questions need to be translated to Cypher queries which will return the appropriate results. Let’s start by answering some basic and advanced questions about the dataset.
Note: For the purposes of this tutorial, we will only display the top 10 results for queries with a large number of rows. This is achieved by using the
LIMIT 10
constraint.Query 1. Find the number of visitors from each country and display them in the descending order of count.
MATCH (v:Visitor) RETURN v.visitor_country AS Country, count(v) AS Count ORDER BY count(v) DESC LIMIT 10
Result:
╒════════╤═════╕ │Country│Count│ ╞════════╪═════╡ │US │312 │ ├────────┼─────┤ │BR │143 │ ├────────┼─────┤ │MX │135 │ ├────────┼─────┤ │PE │47 │ ├────────┼─────┤ │CA │46 │ ├────────┼─────┤ │ES │43 │ ├────────┼─────┤ │GB │36 │ ├────────┼─────┤ │AR │35 │ ├────────┼─────┤ │FR │34 │ ├────────┼─────┤ │CO │32 │ └────────┴─────┘
This query simply performs an internal group by operation where visitor nodes are grouped based on the
visitor_country
property. The count is computed using the count()
aggregate function. We sort the results in the descending order using the ORDER BY <column> DESC
clause in Neo4j.Query 2. For a given document, find the number of visitors from each country.
MATCH (d:Document)<-[:VIEWED]-(v:Visitor) WHERE d.doc_uuid='140228101942-d4c9bd33cc299cc53d584ca1a4bf15d9' RETURN v.visitor_country AS Country, count(v.visitor_country) AS Count ORDER BY count(v.visitor_country) DESC
Result:
╒════════╤═════╕ │Country│Count│ ╞════════╪═════╡ │GY │15 │ ├────────┼─────┤ │CA │12 │ ├────────┼─────┤ │US │11 │ ├────────┼─────┤ │CW │1 │ ├────────┼─────┤ │BB │1 │ └────────┴─────┘
This query is very similar to the one above. Here, we perform an internal group by operation to group the visitor nodes based on the
visitor_country
property. However, this query differs from the previous one in the sense that we want to filter the counts for a particular document UUID.In order to achieve this filtration, we need to utilise the relationship within the graph. Hence, we first
MATCH
, filter using the WHERE
clause and then return the desired values.Tip: The relationship given here…
MATCH (d:Document)<-[:VIEWED]-(v:Visitor)
...can also be written as…
MATCH (v:Visitor)-[:VIEWED]->(d:Document)
Query 3. Find the number of occurrences for each type of viewership activity.
MATCH (d:Document)<-[r:VIEWED]-(v:Visitor) RETURN r.type AS Type, count(d.doc_uuid) AS Count ORDER BY Count ASC
Result:
╒═════════════╤══════╕ │Type │Count│ ╞═════════════╪══════╡ │click │1 │ ├─────────────┼──────┤ │read │62 │ ├─────────────┼──────┤ │pageread │369 │ ├─────────────┼──────┤ │pagereadtime│779 │ ├─────────────┼──────┤ │impression │959 │ └─────────────┴──────┘
This query also performs an internal group by operation on the relationship property
type
. An interesting aspect of this query is the ORDER BY Count ASC
. Previously, we followed the style of using ORDER BY count(d.doc_uuid) ASC
. However, once we add a column name such as Count
, we can use that in subsequent parts of the query.Hence,
ORDER BY count(d.doc_uuid) ASC
can also be written as ORDER BY Count ASC
.Query 4. Find the visitors for each document and display the top three in the descending order of number of visitors.
MATCH (d:Document)<-[r:VIEWED]-(v:Visitor) RETURN DISTINCT d.doc_uuid AS DocUUID, collect(DISTINCT v.visitor_uuid) AS Visitors, count(DISTINCT v.visitor_uuid) AS Count ORDER BY Count DESC LIMIT 3
Result:
╒═════════════════════════════════╤═════════════════════════════════╤═════╕ │DocUUID │Visitors │Count│ ╞═════════════════════════════════╪═════════════════════════════════╪═════╡ │140224101516-e5c074c3404177518│[4f4bd7a35b20bd1f, 78e1a8af51d│26 │ │bab9d7a65fb578e │44194, 4d49271019c7ed96, 6b2d3│ │ │ │cca6c1f8595, 19f5285fef7c1f00,│ │ │ │ 3819fc022d225057, f102d9d4fc4│ │ │ │bacdc, e5d957682bc8273b, abefb│ │ │ │3fe7784f8d3, 6170372b90397fb3,│ │ │ │ 797846998c5624ca, 43dd7a8b2fa│ │ │ │fe059, 3ec465aa8f36302b, d6b90│ │ │ │f07f29781e0, 7bd813cddec2f1b7,│ │ │ │ 3db0cb8f357dcc71, e1bfcb29e0f│ │ │ │3664a, 6d87bcdc5fa5865a, b0ba1│ │ │ │42cdbf01b11, 0930437b533a0031,│ │ │ │ e3392e4a18d3370e, ee14da6b126│ │ │ │3a51e, 502ddaaa898e57c4, 6fd04│ │ │ │0328d2ad46f, 23f8a503291a948d,│ │ │ │ 923f25aa749f67f6] │ │ ├─────────────────────────────────┼─────────────────────────────────┼─────┤ │140228202800-6ef39a241f35301a9│[2f21ee71e0c6a2ce, 55ac6c3ce63│25 │ │a42cd0ed21e5fb0 │25228, e8fa4a9e63248deb, b2a24│ │ │ │f14bb5c9ea3, d2d6e7d1a25ee0b0,│ │ │ │ 6229cca3564cb1d1, 13ca53a93b1│ │ │ │594bf, 47d2608ec1f9127b, 4e2a7│ │ │ │5f30e6b4ce7, 43b59d36985d8223,│ │ │ │ 355361a351094143, 51fd872df55│ │ │ │686a5, 2f63e0cca690da91, febc7│ │ │ │86c33113a8e, 52873ed85700e41f,│ │ │ │ ca8079a4aaff28cb, 17db86d2605│ │ │ │43ddd, b3ded380cc8fdd24, b6169│ │ │ │f1bebbbe3ad, 458999cbf4307f34,│ │ │ │ 280bd96790ade2d4, 32563acf872│ │ │ │f5449, fabc9339a406616d, 36a12│ │ │ │501ee94d15c, 6d3b99b2041af286]│ │ ├─────────────────────────────────┼─────────────────────────────────┼─────┤ │140228101942-d4c9bd33cc299cc53│[b1cdbeca3a556b72, a8cf3c4f144│24 │ │d584ca1a4bf15d9 │9cc5d, 1435542d699350d9, 06d46│ │ │ │5bfb51b0736, 2d41536695cc4814,│ │ │ │ a96854d21780c1f9, d1c98b02398│ │ │ │e9677, 78deb8ffdb03d406, 6c661│ │ │ │964d1d13c61, 0d47795fb1ddba9d,│ │ │ │ 667283570b5cedfe, 5b2baf03296│ │ │ │63564, 08c069dc405cad2e, 6823c│ │ │ │573efad29f6, 9b2cb60327cb7736,│ │ │ │ 0e8ddc2d2a60e14f, f5986e1cb02│ │ │ │378e4, fa3810e505f4f792, d5ed3│ │ │ │cfc4a454fe9, ba76461cdd66d337,│ │ │ │ ee42ba15ed8618eb, 688eb0dcd6a│ │ │ │d8c86, 67c698e88b4fbdcc, c97c3│ │ │ │83d774deae0] │ │ └─────────────────────────────────┴─────────────────────────────────┴─────┘
This query utilizes the
collect()
aggregate function which groups multiple records into a list. An important consideration made here is the use of the DISTINCT
operator to ensure that duplicate values are omitted from the output. Finally, we display the top three using the LIMIT 3
constraint.Query 5. For a given document, find recommendations of other documents like it.
Example 1: document UUID =
130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8
MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document) WHERE d1<>d AND d.doc_uuid='130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8' RETURN d1 AS Recommendations, count(*) AS Views, sum( CASE r1.type WHEN "impression" THEN 1 WHEN "pageread" THEN 1.5 WHEN "pagereadtime" THEN 1.5 WHEN "read" THEN 2 WHEN "click" THEN 0.5 ELSE 0 END ) as Score ORDER BY Score DESC
Result:
╒═════════════════════════════════╤═════╤═════╕ │Recommendations │Views│Score│ ╞═════════════════════════════════╪═════╪═════╡ │{doc_uuid: 130810070956-4f21f4│12 │16 │ │22b9c8a4ffd5f62fdadf1dbee8} │ │ │ └─────────────────────────────────┴─────┴─────┘
Example 2: document UUID =
120831070849-697c56ab376445eaadd13dbb8b6d34d0
MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document) WHERE d1<>d AND d.doc_uuid='120831070849-697c56ab376445eaadd13dbb8b6d34d0' RETURN d1 AS Recommendations, count(*) AS Views, sum( CASE r1.type WHEN "impression" THEN 1 WHEN "pageread" THEN 1.5 WHEN "pagereadtime" THEN 1.5 WHEN "read" THEN 2 WHEN "click" THEN 0.5 ELSE 0 END ) as Score ORDER BY Score DESC
Result:
╒═════════════════════════════════╤═════╤═════╕ │Recommendations │Views│Score│ ╞═════════════════════════════════╪═════╪═════╡ │{doc_uuid: 130701025930-558b15│6 │6 │ │0c485fc8928ff65b88a6f4503d} │ │ │ ├─────────────────────────────────┼─────┼─────┤ │{doc_uuid: 120507012613-7006da│6 │6 │ │2bc335425b93d347d2063dc373} │ │ │ └─────────────────────────────────┴─────┴─────┘
This query aims to find documents similar to a given document by assigning a score based on the type of viewership activity.
Activity | Score |
---|---|
impression |
1 |
pageread |
1.5 |
pagereadtime |
1.5 |
read |
2 |
click |
0.5 |
MATCH
operation to capture the 1st degree and 2nd degree viewership of a visitor node along with the document nodes and relationships. We ensure that two documents nodes are not the same by using the <>
operator and also specify the initial document UUID for which we would like to find related documents.Next, we simply return the recommended document UUIDs, their overall viewership counts and their score. To calculate the score, we utilize the
CASE
expression which is then tallied using the sum()
aggregate function.The
CASE
expression has the following syntax:CASEWHEN THEN //if value matches, then return result [WHEN ...] //repeat until all values are handled [ELSE ] //else return a default result END
Finally, we sort the results in the descending order of score using the
ORDER BY Score DESC
clause!Summary
In this tutorial, we saw an example of performing data analysis using Neo4j. We examined the Issuu Research Dataset and elaborated on its structure, format and fields.Next, we formulated the model/schema of our desired graph by choosing appropriate nodes, relationships and properties. Further, we created constraints and indexes in Neo4j to ensure uniqueness and improve the performance of querying.
After this, we discussed how to import the raw JSON dataset, parse it and populate our graph by following the previously determined schema. Lastly, we saw some sample Cypher queries which helped us derive insight from our vast dataset.
What's Next?
The possibilities are endless! If you enjoyed this tutorial, then you can try to derive data analysis insights using another dataset of your choice.While we chose to construct a rather simple graph, you can make it much more complex and detailed. In addition to this, you can also explore and experiment with the various APOC user defined procedures on the graph.
References
Download My Free Copy