Developer

Flexible Neo4j Batch Import with Groovy

Head of Product Innovation & Developer Strategy, Neo4j

October 10, 2014

4 min read

Written by Michael Hunger, originally posted on his blog.

You might have data as CSV files to create nodes and relationships from in your Neo4j Graph Database.
It might be a lot of data, like many tens of million lines.
Too much for LOAD CSV to handle transactionally.

Usually you can just fire up my batch-importer and prepare node and relationship files that adhere to its input format requirements.

Your Requirements

There are some things you probably want to do differently than the batch-importer does by default:

not create legacy indexes
not index properties at all that you just need for connecting data
create schema indexes
skip certain columns
rename properties from the column names
create your own labels based on the data in the row
convert column values into Neo4j types (e.g. split strings or parse JSON)

Batch Inserter API

Here is where you, even as non-Java developer can write a few lines of code and make it happen.
The Neo4j Batch-Inserter APIs are fast and simple, you get very far with just the few main methods

inserter.createNode(properties, labels) → node-id
inserter.createRelationship(fromId, toId, type, properties) → rel-id
inserter.createDeferredSchemaIndex(label).on(property).create()

Demo Data

For our demo we want to import articles, authors and citations from the MS citation database. // todo

The file looks like this:

author	title	date
Max	Matches	2012-01-01
Mark	Clojure	2013-05-21
Michael	Forests	2014-02-03

author

title

date

Max

Matches

2012-01-01

Mark

Clojure

2013-05-21

Michael

Forests

2014-02-03

Setup with Groovy

To keep the “Java”ness of this exercise to a minimum, I choose groovy a dynamic JVM language that feels closer to ruby and javascript than Java.
You can run Groovy programs as script much like in other scripting languages.
And more importantly it comes with a cool feature that allows us to run the whole thing without setting up and build configuration or installing dependencies manually.

If you want to, you can also do the same in jRuby, Jython or JavaScript (Rhino on Java7, Nashorn on Java8) or Lisp (Clojure).
Would love to get a variant of the program in those languages.

Make sure you have a Java Development Kit (JDK) and Groovy installed.

We need two dependencies a CSV reader (GroovyCSV), and Neo4j and can just declare them with a @Grab annotation and import the classes into scope. (Thanks to Stefan for the tip.)

@Grab('com.xlson.groovycsv:groovycsv:1.0')
@Grab('org.neo4j:neo4j:2.1.4')
import static com.xlson.groovycsv.CsvParser.parseCsv
import org.neo4j.graphdb.*

Then we create a batch-inserter instance which we have to make sure to shut down at the end, otherwise our store will not be valid.
The CSV reading is a simple one-liner, here is a quick example, more details on the versatile config in the [API docs].

csv = new File("articles.csv").newReader()
for (line in parseCsv(csv)) {
   println "Author: $line.author, Title: $line.title Date $line.date"
}

One trick we want to employ is keeping our authors unique by name, so even if they appear on multiple lines, we only want to create them once and then keep them around for the next time they are referenced.

Note:

To keep the example simple we just use a Map, if you have to save memory you can look into a more efficient datastructure and a double pass.
My recommendation would be a sorted name array where the array-index equals the node-id, so you can use Arrays.binarySearch(authors,name) on it to find the node-id of the author.

We define two enums, one for labels and one for relationship-types.

enum Labels implements Label { Author, Article }
enum Types implements RelationshipType { WROTE }

So when reading our data, we now check if we already know the author, if not, we create theAuthor-node and cache the node-id by name.
Then we create the Article-node and connect both with a WROTE-relationship.

  for (line in parseCsv(csv)) {
     name = line.author
     if (!authors[name]) {
        authors[name] = batch.createNode([name:name],Labels.Author)
     }
     date = format.parse(line.date).time
     article = batch.createNode([title:line.title, date:date],Labels.Article)
     batch.createRelationship(authors[name] ,article, Types.WROTE, NO_PROPS)
     trace()
  }

And that’s it.

I ran the data import with the Kaggle Authorship Competition containing 12M Author→Paper relationships.

The import used a slightly modified script that took care of the 3 files of authors, articles and the mapping.
It loaded the 12m rows, 1.8m authors and 1.2m articles in 175 seconds taking about 1-2 s per 100k rows.

groovy import_kaggle.groovy papers.db ~/Downloads/kaggle-author-paper

Total 11.160.348 rows 1.868.412 Authors and 1.172.020 Papers took 174.122 seconds.

You can find the full script in this GitHub gist.

If you’re interested in meeting me, more treats (full-day training, hackathon) or graph database fun in general,
don’t hesitate to join us on Oct 22 2014 in San Francisco for this years GraphConnect conference for all things Neo4j.

Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.

Download My Ebook