[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]
They say that that good artists copy, but great artists steal, right?At Artfinder, the global online marketplace for original art, we’ve just launched ‘My Artfinder’, a mix tape of personal recommendations for users, just like the Spotify Discover Weekly. But not to be outdone, our recommendations are updated daily, thanks to the speed of Neo4j!
When we first started thinking about implementing artwork recommendations for our users, two methodologies came to the forefront quite early on: machine learning and collaborative filtering.
The Challenge with Machine Learning and Art
We currently have ~180,000 artworks listed on the platform, with hundreds of new works added every day. Our artists sign on to the platform and once there, they can upload and classify their artwork freely.
One thing that quickly became apparent with the machine learning option was the necessity for reliable, concrete artwork classifications in order to feed any recommendation engine we would come to develop.
Now, I’m not an art connoisseur by any standard, but one thing that’s clear is that everything about art is highly subjective, so one of the most surefire ways to get correct classifications would be to classify a few hundred thousand artworks manually. This would be the only way to ensure all classifications were equal.
Further to this, different people may choose to classify an artwork differently, raising doubts over the validity of the classifications unless we had a single person do all 200,000. Not really an ideal use of time.
Collaborative Filtering and Neo4j
So, with this in mind, we began looking at the other standout option for real-time recommendations: collaborative filtering. We were looking for something exceptionally powerful, yet flexible enough to meet our particular demands, and were delighted by Neo4j’s ease of use as well it’s amazing online resources.
Getting an initial proof-of-concept up and running was a breeze. We already had a fairly large dataset when it came to what our users personally like/dislike, so implementation was straightforward.
I’ve spent a lot of time working with emerging technologies and one thing that immediately struck me about Neo4j was the simplicity of the Cypher query language and how easy it is to construct and read queries. I found reading Cypher queries to be infinitely easier to understand than anything I’ve come across in the RDBMS/SQL world.
Cypher’s simple, yet powerful, syntax made working with the graph infinitely easier and iterating, testing and profiling queries later on in the development cycle was that much easier with the built-in web front-end (i.e., the Neo4j Browser). We could analyse query performance and get visual results quickly to see where any bottlenecks were and quickly iterate and try out different versions which all aided in getting a usable system up and running quickly.
What We Learned at Artfinder using Neo4j
Here’s a few things we learned:
1. We used the
EXPLAIN
and PROFILE
functions to get a good understanding for how Neo4j is interpreting our queries. In a lot of cases,
PROFILE
was able to show us exactly where our queries were falling down (e.g., whether it be a forgotten index resulting in a full DB scan). Having immediate visibility of these sticking points meant we could solve them that much faster and saved us plenty of time and head-scratching. 2. We were sparing in our use of node labels when performing lookups.
At first, it can seem counter-intuitive to be less specific in your queries, but in a lot of cases, specifying a node label on a related note will result in an unneeded
FILTER
operation against the returned set. Here is an example of finding all of the artwork that one of our users likes:
MATCH (user:User {id: 1})-[:LIKES_ARTWORK]->(artwork:Artwork) RETURN artworkDB hits = 3060
Whereas leaving off the
Artwork
label from the related node results in: MATCH (user:User {id: 1}-[LIKES_ARTWORK]->(artwork) RETURN artworkDB hits = 2300
Needless to say, this isn’t always the case, but understanding this concept went a long way towards improving the speed of our queries.
3. When we modelled our data, we were very specific with relationships between our nodes. This allowed us to be much more targeted with our queries. For example:
(User:User)-[:LIKES]->(Artist) (User:User)-[:LIKES]->(Artwork)
The above re-uses the
LIKES
relationship label to specify a relationship between a user and artwork and artist alike. If you have 10 artists and 20 artworks, doing a scan for :LIKES
may touch all 30 nodes (or more). So it’s much better to be specific with relationships: (User)-[:LIKES_ARTIST]->(Artist) (User)-[:LIKES_ARTWORK]->(Artwork)
This relationship specificity limited the amount of nodes returned when querying for a very specific type of relationship.
4. Reduce Cardinality of Results
Often you will chain results from one
MATCH
statement into another, using the results from the first as the starting point for the next traversal. In instances where the first statement may return the same nodes multiple times, you can shorten your query time by reducing duplicates. An example of this would be finding all
User
nodes that LIKE
the same items as you and then going on to find out what kind of music those users listen to. MATCH (:User {id: 1)-[:LIKES]->(something)<-[:LIKES]-(user), (user)-[:LISTENS_TO_GENRE]-(genre) RETURN genre.name
In the first portion of this query, you may find that two of the users like multiple things you also like. This would mean that we run two extra traversals in the next portion of the query.
To avoid this, we use
DISTINCT
to filter out duplicates and reduce the number of further lookups we need to do. MATCH (:User {id: 1})-[:LIKES]-(something)<-[:LIKES]-(user) WITH DISTINCT user MATCH (user)-[:LISTENS_TO_STYLE_GENRE]-(genre) RETURN DISTINCT genre.name
By removing any duplicates, we significantly reduce the number of traversals and therefore reduce the query time significantly. This has a really big impact when the graph gets bigger with time – with special thanks to Mark Needham.
Conclusion
So, what have we achieved?
I’d go as far as to say that as the first site in the art space to deliver a completely personalised home page, ‘My Artfinder’ is a new way of shopping for art. In a traditionally curator-led, advisory market, personalised recommendations based on individual users’ tastes are a huge leap forwards.
While we're definitely not done tweaking, refactoring and optimizing our implementation, it still amazes me at the speed with which we could go from concept to production.
An invaluable tool in our quest to lead the way in personalisation within the art space, Neo4j has helped our team to develop, deploy and maintain an in-production graph database system that provides thousands of users with relevant, real-time recommendations on a daily basis.
Download this whitepaper – Powering Recommendations with a Graph Database – and discover how companies like eBay, Walmart and Glassdoor are using graph databases to power their own real-time recommendation engines.