In this week’s 5-minute interview, Benjamin Squire, Senior Data Scientist at Meredith Corporation explains how his company found Neo4j and how they eventually came to rely on graph databases to solve complex problems.
How are you using Neo4j?
When you visit a website, the site drops a cookie that is used for informing the company about which web pages you visit on the website. This enables us to glean insights into what your interests are. Previously, that cookie was the only thing that we had to understand a person’s behavior, but the lifetime of a cookie often is very short due to people deleting their cookies, to firewalls or to the browser preventing cookies from dropping, et cetera.
This identity graph enables us to use multiple streams of data. Previously we’d have one stream of data that would say this person’s behavior looks like this, and we’d have another stream that somewhat matched up, but it wasn’t able to easily join them based on the connecting factors. By using the graph, we’ve tied together those data streams by looking at commonalities between the two streams. You can think of it like a JOIN in SQL, which is basically identifying a key that is similar in both tables.
Because of the graph’s connectivity, we’re actually able to look at all of the different things that connect these data streams. That allows us to recognize the same user better over time than we could before by just looking at a cookie. We basically have increased our understanding of a customer by 20 or 30 percent, just looking at how the data connects over time, rather than just looking at individual cookies.
How were you solving this problem before Neo4j?
With the traditional database approach, we had to treat streams of data separately. We had no way to really link them together due to conflicts of timestamps and some cookies not matching.
The number of JOINs to bring those datasets together would have required too much computation to actually combine all of those data streams. We ended up using a graph database that allowed us not to worry about the timestamp and where the data was coming from. Instead we could actually link them together and create a confluence of all those datasets.
What made you choose Neo4j?
Neo4j stood out for the ease of bringing in the data in the beginning. Once we were exploring it with the pattern-matching of Neo4j’s query language Cypher, we were able to identify the data that we were looking for, which is basically understanding how cookies connect over time and how we could use that understanding to create profiles of people visiting our websites.
Rather than focusing on individual streams and cookies by themselves, we actually were creating groups of cookies or profiles that represented what our customers are reading about and interested in a lot better than we had the ability to before using just traditional databases.
The Graph Data Science suite enabled us to identify and enumerate these patterns. Not only were we able to identify the data we were interested in, but then we were also able to actually bring that knowledge, export, and act upon that data as well.
What have been some of the most surprising results you’ve seen?
The most surprising result was really seeing how connected that it was. I used to think that we knew this data really well when we looked at it individually from each different data stream, but when you combine them all together and you actually look at the datasets as a whole, it makes you realize that it’s like trying to solve a Rubik’s Cube puzzle by only looking at one side of it.
With Neo4j, we’re actually able to combine all of these different datasets. It’s like seeing the Rubik’s Cube in three dimensions and it made it a lot easier to comprehend and understand how to act upon it.
What are you able to do now that you weren’t able to do previously?
Graph algorithms enabled us to really scale what we were after. When we started out with just looking at the patterns of data that we were interested in, it worked but it wasn’t efficient in that we had to basically do the same computation millions and millions of times, which wasn’t scalable to the billions of cookies that we had.
So the power of the algorithms is that we were able to actually apply this computation of Weakly Connected Components (Union Find) across the entire dataset, which was challenging due to the data size and fitting everything into memory and making sure the hardware matched what we needed it to do. But it was able to avoid having to repeat the same small Cypher computations millions of times, and just do it globally at once using the in-memory Neo4j algorithms.
What is your favorite part about working with Neo4j?
The best part of working with Neo4j is the community. When I was developing this project, we had a lot of challenges with getting data in and trying to scale to the level that we have now. And it was really through the community.neo4j.com website that I got a lot of answers, which was really helpful. I received answers from staff and other Neo4j enthusiasts like myself.
What is next for your project?
We are now pushing our project into the commercial space and delivering it for our advertisers. But besides that, I really want to push the analytics of Neo4j and look at some of the other algorithms that are available right now and gain more insights. We’re really focused on just making the data more efficient and getting the runtime down for processing this much data.
Any advice for someone just getting started with Neo4j?
I think that you have to focus on what you’re after. When we started out this project, we kind of threw all the data we could in there. And as we scaled it, we realized that we really needed to pick and choose our battles. I think that’s one of the challenges with graphs. It’s such a new way to look at your data as I explained with the Rubik’s Cube analogy. When you look at it from that third dimension and you start seeing all the different sides of it, you start to think, “Oh, I can solve all of these problems at once.”
But when you start scaling to 20 billion nodes, you need to focus on exactly what you’re trying to solve. You will need to have a different graph for each type of project to really focus in and scale to the size that you’re after. Someone new to Neo4j should use the Neo4j Community a lot to ask questions. People in the community are responsive and can help you out a lot.
Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at email@example.com
Show off your graph database skills to the community and employers with the official Neo4j Certification. Click below to get started and you could be done in less than an hour.