You might have data as CSV files to create nodes and relationships from in your Neo4j Graph Database. It might be a lot of data, like many tens of million lines. Too much for LOAD CSV to handle transactionally.
Your Requirements
- not create legacy indexes
- not index properties at all that you just need for connecting data
- create schema indexes
- skip certain columns
- rename properties from the column names
- create your own labels based on the data in the row
- convert column values into Neo4j types (e.g. split strings or parse JSON)
Batch Inserter API
inserter.createNode(properties, labels)
→ node-idinserter.createRelationship(fromId, toId, type, properties)
→ rel-idinserter.createDeferredSchemaIndex(label).on(property).create()
Demo Data
author | title | date |
---|---|---|
Max |
Matches |
2012-01-01 |
Mark |
Clojure |
2013-05-21 |
Michael |
Forests |
2014-02-03 |
Setup with Groovy
@Grab
annotation and import the classes into scope. (Thanks to Stefan for the tip.)@Grab('com.xlson.groovycsv:groovycsv:1.0')
@Grab('org.neo4j:neo4j:2.1.4')
import static com.xlson.groovycsv.CsvParser.parseCsv
import org.neo4j.graphdb.*
Then we create a batch-inserter instance which we have to make sure to shut down at the end, otherwise our store will not be valid. The CSV reading is a simple one-liner, here is a quick example, more details on the versatile config in the [API docs].
csv = new File("articles.csv").newReader()
for (line in parseCsv(csv)) {
println "Author: $line.author, Title: $line.title Date $line.date"
}
One trick we want to employ is keeping our authors unique by name, so even if they appear on multiple lines, we only want to create them once and then keep them around for the next time they are referenced.
Arrays.binarySearch(authors,name)
on it to find the node-id of the author. enum Labels implements Label { Author, Article }
enum Types implements RelationshipType { WROTE }
So when reading our data, we now check if we already know the author, if not, we create theAuthor-node and cache the node-id by name. Then we create the Article-node and connect both with a WROTE-relationship.
for (line in parseCsv(csv)) {
name = line.author
if (!authors[name]) {
authors[name] = batch.createNode([name:name],Labels.Author)
}
date = format.parse(line.date).time
article = batch.createNode([title:line.title, date:date],Labels.Article)
batch.createRelationship(authors[name] ,article, Types.WROTE, NO_PROPS)
trace()
}
And that’s it.
groovy import_kaggle.groovy papers.db ~/Downloads/kaggle-author-paper
Total 11.160.348 rows 1.868.412 Authors and 1.172.020 Papers took 174.122 seconds.
Download My Ebook