Fun with Music, Neo4j and Talend

Many of you know that I am a big fan of Belgian beers. But of course I have a number of other hobbies and passions. One of those being: Music. I have played music, created music (although that seems like a very long time ago) and still listen to new music almost every single day. So when sometime in 2006 I heard about this really cool music site called, I was one of the early adopters to try use it. So: a good 7 years later and 50k+ scrobbles later, I have quite a bit of data about my musical habits.

On top of that, I have a couple of friends that have been using as well. So this got me thinking. What if I was somehow able to get that data into neo4j, and start “walking the graph”? I am sure that must give me some interesting new musical insights… It almost feels like a “recommendation graph for music” … Let’s see where this brings us.

Basically, the approach I took had three simple high-level steps

  1. get the data from
  2. model that data into a neo4j graph
  3. pump the data into the neo4j database using an import tool
  4. query the data for hours on end 😉
So let’s get right to it.

Step 1: exporting the data from

Turns out there are some cool tools out there to get the scrobble data out of I used the LastToLibre “” script, which is very easy and simple: run it, give it a user name, and the public scrobbles will be available shortly after in a text file with the date, trackname, artistname, albumname and then the “MusicBrainzidentifiers for the track (trackmbid), artist (artistmbid) and album (albummbid). I did this for myself and two friends, and got a sizeable dataset.

Step 2: create a model out of this.

From this dataset, I then had to create a neo4j graph model. As you know, there are multiple ways you could model the data – it really depends on the query patterns – but after some consideration I went with this model below:

Ok, so I got the data, and got the model – how do I now get it into neo4j?

Step 3: import the data

This is where it got interesting. The spreadsheet import mechanism worked ok – but it really wasn’t great. It took more than an hour to get the dataset to load – so I had to look for alternatives. Thanks to my French friend and colleague Cédric, I bumped into the Talend ETL (Extract – Transform – Load) tools. I found out that there was a proper neo4j connector that was developed by Zenika, a French integrator that really seems to know their stuff.

So on I went. Installed the Talend Open Studio for Big Data, which is Open Source Software, just like Neo4j, and started playing around. I found the tools quite intuitive – although I must admit that Cédric was a great help to show me around. All I had to do was to created a Talend job, which consisted of a couple of steps:
  • Import the nodes: 2 subjobs, one for nodes with name and type, one other for the nodes that have name, type and a “Musicbrainz Identifier” (artists, tracks, albums),
  • Import the relationships: 7 subjobs, one for every relationship type (see model). Important there is that there is an additional step here, which is to make the relationships “unique”, and avoid that some relationships would be created twice.

It was very interesting to see that the import process only took about a minute with Talend, versus more than an hour with my Excel/neo4j-shell method. And if I would use the “batch import” mechanism (which does not commit transactions) it would probably be even faster. Here’s a little overview video that shows you how to do it step by step – it literally only takes 10 minutes to do.

Step 4: do some neo4j Cypher queries

After the talend job was done, I started to experiment with some Cypher queries: figure out which artists my friends had been listening to on the same day: 

    distinct, artist.title;

Seems like these queries are not that trivial, and there probably still is quite a bit of optimisation to be done – but that’s way above my capabilities. And obviously there are many more ideas for interesting queries – the music domain is very graphy in nature, and allows for more hours of graph fun. But that will be for a later time.

Hope this is useful! Enjoy the summer!





Rik,<br />very cool! Just one question: where did you get the LastFM CSV files from?

bww00 says:

Great post<br />The example workd like a charm.<br />Do you have examples of running cypher queries thru talend ?<br /><br />Regards<br />Bryan

Chris Nott says:

Great article Rik. Thanks for sharing.<br /><br />Just for info I tried this against a Neo4j 2.0.0-M03 instance and (perhaps unsurprisingly!) I could not get it to work. From a quick poke around it appears that you have to be using 1.9.x.

@bww00: i don&#39;t have a real working example lying around, but you just need to use the &quot;tNeo4jRow&quot; component (instead of tNeo4jOutput or the likes). I got it to work quite easily, but performance was slower. I believe it has something to do with the caching/parsing of the cypher queries: if you use tNeo4jRow, every Cypher query is separate and includes more overhead…<br /><br />

@Chris: you are right – it only works on 1.9.x as i understand it…

Great Post. Fast and easy to start querying around. Nice question to ask via Cypher query. It was very engaging to find an optimization. It drops time by 1/10 if you can filter the dates to query<br />If I may share my try:<br /><br />START<br /> sta=node:node_auto_index(name=&quot;STA&quot;), <br /> sno=node:node_auto_index(name=&quot;SNO&quot;) <br />MATCH <br /> sta-[:LOGS]-&gt;

Where does destination component &#39;tNeo4jOutput&#39; stores data? Can I see the path where it stores loaded data from source?

Rik,<br /><br />Great post !<br /><br />For the tNeo4jRow component, normaly the request execution plan is kept and is reused for each row ? <br />

angelina says:

Thank you so much for this tuto, i’m beginner with neo4j and i don’t know how to create the neo4j graph model , i want to analyse CDR files which contains details of the calls and details about the caller and the receiver.
Should i create two nodes files ? person file and call file ?
any help ?

tonn says:

Hi thank you very much for this post its really helpful.
I’m new with graph databases and i have to create model to load information about calls made and received.
In my xls file i have : the duration of the call, Time of the call, Date of the call, call type (in or out), country, city, the calling number, the called number, gender of the calling, age of the calling and finally comment (no answer, user busy…) .
So i thought about this graph model : 5 nodes : caller, call, receiver, country and city
and 4 relationships: made_call, received_call, located_in and has_city.

Do you think that that’s right way to model the data ??
Thank you very much

Nils Teller says:

Great ! But some links aren’t working.
Can you give us the new links to the dataset and the python-code ?

Thanks in advance

Leave a Reply

Your email address will not be published. Required fields are marked *

GraphConnect SF 2015

Share your Graph Story?

Email us:

Have a Graph Question?

Contact Us

Popular Graph Topics