Chapter 2. The Yelp example

This chapter introduces the Yelp Open Dataset that is used throughout to exemplify how the Neo4j Graph Algorithms work.

2.1. The Yelp Open Dataset

Yelp.com has been running the Yelp Dataset challenge since 2013; a competition that encourages people to explore and research Yelp’s open dataset. As of Round 10 of the challenge, the dataset contained:

  • almost 5 million reviews
  • over 1.1 million users
  • over 150,000 businesses
  • 12 metropolitan areas

Since its launch, the dataset has become very popular, with hundreds of academic papers written about it. It has well-structured, and highly relational data, and is therefore a realistic dataset with which to showcase Neo4j and graph algorithms.

We will illustrate how to use graph algorithms on a social network of friends, and how to create and analyse an inferred graph (for example, projecting a review co-occurence graph, or similarity between users based on their reviews). For more information, it is also worth checking out past winners, and their work.

2.2. Data

In Round 10 of the challenge, the dataset included:

  • 156,639 businesses
  • 1,005,693 tips from users about businesses
  • 4,736,897 reviews of businesses by users
  • 9,489,337 users total
  • 35,444,850 friend relationships

You can download the dataset in JSON format by filling out a form on Yelp’s website. There are 6 JSON files available (detailed documentation). For the purposes of this example, we will ignore the photos and checkins files as they are not relevant for our analysis.

We will create a knowledge graph from the rest of the files, and will use the APOC plugin to help us with importing and batching data in Neo4j. Depending on your setup, import might take some time (the user.json file contains data for about a 10 million-person social network of friends). While review.json is even bigger in size, it is mostly made up of the text that represents the actual review, so the import will be faster. We also do not need the actual text, but only the meta-data about them. For example, meta-data on who wrote the review and how a certain business was rated is imported, but the text itself will not be imported.

2.3. Graph model

yelp graph model

Our graph contains User labelled nodes, that can have a FRIEND relationship with other users. Users also write reviews and tips about businesses. All of the meta-data is stored as properties of nodes, except for categories of the businesses, which are represented by separate nodes labeled Category.

Graph model always depends on the application we have in mind for it. Our application is to analyse (inferred) networks with graph algorithms. If we were to use our graph as a recommendation engine, we might construct a different graph model.

For further information on using Neo4j as a recommendation engine, check out this great guide or this educational video.

2.4. Import

Define graph schema (constraint/index). 

CALL apoc.schema.assert(
{Category:['name']},
{Business:['id'],User:['id'],Review:['id']});

Load businesses. 

CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/business.json') YIELD value RETURN value
","
MERGE (b:Business{id:value.business_id})
SET b += apoc.map.clean(value, ['attributes','hours','business_id','categories','address','postal_code'],[])
WITH b,value.categories as categories
UNWIND categories as category
MERGE (c:Category{id:category})
MERGE (b)-[:IN_CATEGORY]->(c)
",{batchSize: 10000, iterateList: true});

Load tips. 

CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/tip.json') YIELD value RETURN value
","
MATCH (b:Business{id:value.business_id})
MERGE (u:User{id:value.user_id})
MERGE (u)-[:TIP{date:value.date,likes:value.likes}]->(b)
",{batchSize: 20000, iterateList: true});

Load reviews. 

CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/review.json')
YIELD value RETURN value
","
MERGE (b:Business{id:value.business_id})
MERGE (u:User{id:value.user_id})
MERGE (r:Review{id:value.review_id})
MERGE (u)-[:WROTE]->(r)
MERGE (r)-[:REVIEWS]->(b)
SET r += apoc.map.clean(value, ['business_id','user_id','review_id','text'],[0])
",{batchSize: 10000, iterateList: true});

Load users. 

CALL apoc.periodic.iterate("
CALL apoc.load.json('file:///dataset/user.json')
YIELD value RETURN value
","
MERGE (u:User{id:value.user_id})
SET u += apoc.map.clean(value, ['friends','user_id'],[0])
WITH u,value.friends as friends
UNWIND friends as friend
MERGE (u1:User{id:friend})
MERGE (u)-[:FRIEND]-(u1)
",{batchSize: 100, iterateList: true});

2.5. Networks

2.5.1. Social network

A Social network is a theoretical construct, useful in the social sciences to study relationships between individuals, groups, organizations, or even entire societies. An axiom of the social network approach to understanding social interaction is that social phenomena should be primarily conceived and investigated through the properties of relationships between and within nodes, instead of the properties of these nodes themselves. Precisely because many different types of relations, singular or in combination, form these network configurations, network analytics are useful to a broad range of research enterprises.

Social network analysis is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, memes spread, friendship and acquaintance networks, collaboration graphs, kinship, and disease transmission.

Social network analysis has emerged as a key technique in modern sociology. It has also gained a significant following in anthropology, biology, demography, communication studies, economics, geography, history, information science, organizational studies, political science, social psychology, development studies, sociolinguistics, and computer science.

Yelp’s friendship network is an undirected graph with unweighted friend relationships between users. While there are over 500,000 users with no friends, they will be ignored in this analysis.

2.5.1.1. Global graph statistics:

Nodes : 8981389

Relationships : 35444850

Weakly connected components : 18512

Nodes in largest WCC : 8938630

Edges in largest WCC : 35420520

Triangle count :

Average clustering coefficient :

Graph diameter (longest shortest path):

2.5.1.2. Local graph statistics:

Use apoc to calculate local statistics. 

MATCH (u:User)
RETURN avg(apoc.node.degree(u,'FRIEND')) as average_friends,
       stdev(apoc.node.degree(u,'FRIEND')) as stdev_friends,
       max(apoc.node.degree(u,'FRIEND')) as max_friends,
       min(apoc.node.degree(u,'FRIEND')) as min_friends

Average number of friends : 7.47

Standard deviation of friends : 46.96

Minimum count of friends : 1

Maximum count of friends : 14995

Prior work:

2.5.2. Projecting a review co-occurence graph

We can try to find which businesses are often reviewed by the same users, by inferring a co-occurence network between them.

Co-occurrence networks are the collective interconnection of nodes, based on their paired presence within a specified domain. Our network is generated by connecting pairs of businesses using a set of criteria defining co-occurrence.

The co-occurrence criteria for this network is that any pair of businesses must have at least 5 common reviewers. We save the count of common reviewers as a property of the relationship that will be used as a weight in community detection analysis. Inferred graph is undirected, as changing the direction of the relationships does not imply any semantic difference. We will limit our network to those businesses, that have more than 10 reviews and project a co-occurrent relationship between businesses:

Project a review co-occurence between businesses. 

CALL apoc.periodic.iterate('
MATCH (b1:Business)
WHERE size((b1)<-[:REVIEWS]->()) > 10 AND b1.city="Las Vegas"
RETURN b1
','
MATCH (b1)<-[:REVIEWS]-(r1)
MATCH (r1)<-[:WROTE]-(u)
MATCH (u)-[:WROTE]->(r2)
MATCH (r2)-[:REVIEWS]->(b2)
WHERE id(b1) < id(b2) AND b2.city="Las Vegas"
WITH b1, b2, COUNT(*) AS weight where weight > 5
MERGE (b1)-[cr:CO_OCCURENT_REVIEWS]-(b2)
ON CREATE SET cr.weight = weight
',{batchSize: 1});

2.5.3. Projecting a review similarity graph

We can try to find similar groups of users by projecting a review similarity network between them. The idea is to start with users that have more than 10 reviews, and find all pairs of users who have reviewed more than 10 common businesses. We do this to filter out users with not enough data. We could do something similar to filter out users who have reviewed every business (probably a bot, or someone very bored!).

Once we find pairs of users, we calculate their similarity of reviews by using cosine similarity, and by only creating a relationship if cosine similarity is greater than 0; which is sometimes also called hard similarity. We do this so we do not end up with complete graph, where every pair of users is connected. Most community detection algorithms perform poorly in a complete graph. Cosine similarity between pairs of users is saved as a property of relationship and can be used as a weight in graph algorithms. Projected graph is modeled undirected, as the direction of the relationships have no semantic value.

Projecting a review similarity graph is often used in recommendations; similar users are calculated based on review ratings, so we can recommend to a user what similar users liked.

Create a review similarity graph. 

CALL apoc.periodic.iterate(
"MATCH (p1:User) WHERE size((p1)-[:WROTE]->()) > 5 RETURN p1",
"
MATCH (p1)-[:WROTE]->(r1)-->()<--(r2)<-[:WROTE]-(p2)
WHERE id(p1) < id(p2) AND size((p2)-[:WROTE]->()) > 10
WITH p1,p2,count(*) as coop, collect(r1.stars) as s1, collect(r2.stars) as s2 where coop > 10
WITH p1,p2, apoc.algo.cosineSimilarity(s1,s2) as cosineSimilarity WHERE cosineSimilarity > 0
MERGE (p1)-[s:SIMILAR_REVIEWS]-(p2) SET s.weight = cosineSimilarity"
, {batchSize:100, parallel:false,iterateList:true});

Prior work: