Caching Partial Traversals in Neo4j


Repost from Max De Marzi, 23 March, 2014
cache_all_the_things

Sometimes you’ll find yourself looking at a traversal and thinking… “I’m going to be doing this one thing over and over again.” That sounds kind of wasteful and years of recycling have taught us not to be wasteful. Let’s take a look at an example from our past. Look back at the Neo Love application, the one with the picture of Marilyn Monroe and Groucho Marx. Let’s see what a Neo4j 2.0 version of that query would look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
MATCH (me:Person {name: {name}})-[:LIVES_IN]->city<-[:LIVES_IN]-person
WHERE me.orientation = person.orientation
  AND (me.orientation = "straight" XOR (me.gender = person.gender))
  AND me-[:WANTS]->()<-[:HAS]-person
  AND me-[:HAS]->()<-[:WANTS]-person
WITH DISTINCT city.name AS city_name, person, me
MATCH me-[:WANTS]->attributes<-[:HAS]-person-[:WANTS]->requirements<-[:HAS]-me
RETURN city_name, person.name AS person_name,
       COLLECT(DISTINCT attributes.name) AS my_interests,
       COLLECT(DISTINCT requirements.name) AS their_interests,
       COUNT(DISTINCT attributes) AS matching_wants,
      COUNT(DISTINCT requirements) AS matching_has
ORDER BY matching_wants / (1.0 / matching_has) DESC
LIMIT 10

Now think about what would happen if 1000 users all asked for this query at the same time. If many of them lived in the same city, we’d traverse the graph looking for people of that city multiple times. For each one those people, we’d traverse looking for the things they have and the things they want over and over to compare to our target user. That seems pretty wasteful. At this point Cypher has no notion of what’s going on outside the current query… so there is not much we can do about it here, but what if we turned this into Java?

First we’d start off with declaring some of the things we want to collect along the way and start our transaction. In 2.0 even reads happen inside a transaction:

1
2
3
4
5
6
7
public Response getLoves(@PathParam("name") String name, @Context GraphDatabaseService db) throws IOException {
        List<HashMap<String, Object>> results = new ArrayList<>();
        HashSet<Node> people = new HashSet<>();
        HashMap<Node, ArrayList<String>> peopleWant = new HashMap<>();
        HashMap<Node, ArrayList<String>> peopleHave = new HashMap<>();
        try ( Transaction tx = db.beginTx() )

We want to find the Person with the “name” property being passed in. I am cheating a little and using IteratorUtil to quickly grab the first (and only) node from the Label look-up.

1
2
3
4
final Node user = IteratorUtil.
                    singleOrNull(db.findNodesByLabelAndProperty(Labels.Person,
                                                                "name",
                                                                name));

Assuming our user is not null, we can start collecting their information. We want their gender and orientation as well as the “things” they want in a potential mate and have to offer a potential mate.

1
2
3
4
5
6
7
8
9
10
11
12
13
if(user != null) {
      String myGender = (String) user.getProperty("gender");
      String myOrientation = (String) user.getProperty("orientation");
      List<String> myWants = new ArrayList<>();
      List<String> myHas = new ArrayList<>();
      for(Relationship wants : user.getRelationships(RelationshipTypes.WANTS, Direction.OUTGOING)){
          myWants.add((String) wants.getEndNode().getProperty("name"));
      }
      for(Relationship has : user.getRelationships(RelationshipTypes.HAS, Direction.OUTGOING)){
          myHas.add((String) has.getEndNode().getProperty("name"));
      }

For each place they live in (think about college students that go home on weekends, people who have long commutes, travel, etc.) we’re going to try to pair them up with a person in the same location. Making sure to skip themselves.

1
2
3
4
5
6
7
8
for(Relationship lives_in : user.getRelationships(RelationshipTypes.LIVES_IN, Direction.OUTGOING)){
    Node location = lives_in.getEndNode();
    for(Relationship lives_in_too : location.getRelationships(RelationshipTypes.LIVES_IN, Direction.INCOMING)){
        Node person = lives_in_too.getStartNode();
        if(person.equals(user)){
            continue;
        }

If they have compatible orientation and gender, we’ll go through each person in that city and collect the things that person wants in a mate, or has to offer a mate only if they match up with our user.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
if((myOrientation.equals("straight")) ^ (myGender == person.getProperty("gender") )){
    people.add(person);
    ArrayList<String> theirMatchingWants = new ArrayList<>();
    ArrayList<String> theirMatchingHas = new ArrayList<>();
    for(Relationship wants : person.getRelationships(RelationshipTypes.WANTS, Direction.OUTGOING)){
        String theirWant = (String) wants.getEndNode().getProperty("name");
        if(myHas.contains(theirWant)){
            theirMatchingWants.add(theirWant);
        }
    }
    for(Relationship has : person.getRelationships(RelationshipTypes.HAS, Direction.OUTGOING)){
        String theirHas = (String) has.getEndNode().getProperty("name");
        if(myWants.contains(theirHas)){
            theirMatchingHas.add(theirHas);
        }
    }
    peopleHave.put(person, theirMatchingHas);
    peopleWant.put(person, theirMatchingWants);
}

If they have at least one matching “want” and one matching “has” we’ll add them to our results list.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
String locationName = (String) location.getProperty("name");
for (Node person : people){
    if(peopleHave.get(person).size() > 0 && peopleWant.get(person).size() > 0){
        HashMap<String, Object> result = new HashMap<>();
        result.put("location", locationName);
        result.put("name", person.getProperty("name"));
        result.put("my_interests", peopleHave.get(person));
        result.put("their_interests", peopleWant.get(person));
        result.put("matching_wants", peopleHave.get(person).size());
        result.put("matching_has", peopleWant.get(person).size());
        results.add(result);
    }
}

We’ll sort the results with a custom Comparator:

1
Collections.sort(results, resultComparator);

That comparator will do our sorting for us:

1
2
3
4
5
6
7
8
private static final Comparator<HashMap<String, Object>> resultComparator =  new Comparator<HashMap<String, Object>>() {
       @Override
       public int compare(HashMap<String, Object> o1, HashMap<String, Object> o2) {
           double o1Value = ((int) o1.get("matching_wants") / (1.0/ (int) o1.get("matching_has")));
           double o2Value = ((int) o2.get("matching_wants") / (1.0/ (int) o2.get("matching_has")));
           return (int)(o2Value - o1Value);
       }
   };

Finally we’ll grab just the top 10 results and return the JSON response.

1
2
3
return Response.ok().entity(objectMapper.
                             writeValueAsString(
                               results.subList(0,Math.min(10, results.size())))).build();

You can see all of code on this gist.

So how would we go about caching some of this? Well, we are going to traverse:

1
MATCH (person:Person)-[:WANTS]->(something:Thing)

and

1
MATCH (person:Person)-[:HAS]->(something:Thing)

Over and over again. This is something that on an individual user basis won’t change very much. When a new user registers you will see a flurry of activity here, but once they have their profile set up, changes will be rare. So let’s cache these with Google Guava.

320px-Guava_bangalore

We’ll create a LoadingCache with a maximum size of 1M users which will call the loadWants method.

There are many different options for cache invalidation, read the documentation to find the option that fits your use case best. I’m choosing to expire the keys after 1 day as that makes sense for this particular case. I can update an individual values when a new WANT relationship is added or deleted, but I can also just not worry about it and time will take care of things.

cat_cache

1
2
3
4
5
6
7
8
9
private static final LoadingCache<Node, ArrayList<String>> peopleWants = CacheBuilder.newBuilder()
        .expireAfterWrite(1, TimeUnit.DAYS)
        .maximumSize(1000000)
        .build(
                new CacheLoader<Node, ArrayList<String>>() {
                    public ArrayList<String> load(Node person) {
                        return loadWants(person);
                    }
                });

Since I’m just interested in the names of the Things people want and have, I’ll save the names into an ArrayList instead of saving the whole nodes.

1
2
3
4
5
6
7
private static final ArrayList<String> loadWants(Node person){
    ArrayList<String> personWants = new ArrayList<>();
    for (Relationship wants : person.getRelationships(RelationshipTypes.WANTS, Direction.OUTGOING)){
        personWants.add((String) wants.getEndNode().getProperty("name"));
    }
    return personWants;
}

We will do the same thing for peopleHas, and now we can reference it in our code into a few places:

1
2
3
for(Relationship wants : user.getRelationships(RelationshipTypes.WANTS, Direction.OUTGOING)){
    myWants.add((String) wants.getEndNode().getProperty("name"));
}

becomes:

1
List<String> myWants = peopleWants.get(user);

and

1
2
3
4
5
6
for(Relationship wants : person.getRelationships(RelationshipTypes.WANTS, Direction.OUTGOING)){
    String theirWant = (String) wants.getEndNode().getProperty("name");
    if(myHas.contains(theirWant)){
        theirMatchingWants.add(theirWant);
    }
}

becomes:

1
2
3
4
5
for(String wants : peopleWants.get(person)){
    if(myHas.contains(wants)){
        theirMatchingWants.add(wants);
    }
}

I will add another cache for:

1
MATCH city<-[:LIVES_IN]-person

However in this case I want to keep this list constantly updated, so my LoadingCache will automatically refresh the values every 10 minutes. What’s nice about refresh is that while the traversal is being calculated, readers will get the old value, so nobody has to take the hit waiting for this query.

1
.refreshAfterWrite(10, TimeUnit.MINUTES)

The full source code for this proof of concept is available on github.

Partial Traversal Caches come in handy when building recommendation engines using Collaborative Filtering since we can cache more complex traversals. For example if we start with the movies a user likes:

1
MATCH user-[:LIKES]->movies

We can cache all the other movies people liked who also like the same movies the user does:

1
2
CACHE MATCH movies<-[r1:LIKES]-person-[r2:LIKES]->other_movies
WHERE r1.rating > 3 AND r2.rating > 3

Now that CACHE MATCH command doesn’t exist yet. I can see all kinds of ways this could go:

1
2
3
CACHE MATCH UPTO 100000 ITEMS movies<-[r1:LIK...
CACHE MATCH UPTO 100000 ITEMS FOR 2 HOURS movies<-[r1:LIK...
CACHE MATCH UPTO 100000 ITEMS FOR 10 MINUTES WITH REFRESH movies<-[r1:LIK...

So here is your chance to build it, create a Pull Request and come work for us!




Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.

Download My Ebook