Neo4j and Apache Hadoop

Hadoop is well established as large scale data processing platform. It can easily pre-process huge datasets and information streams to extract and project the high quality data vectors that enrich your graph model with relevant new information. Here we want to demonstrate some approaches that used Hadoop jobs to prepare data for ingestion into Neo4j.
You should have a sound understanding of both Hadoop and Neo4j, each data model, infrastructure requirements, data processing paradigm and APIs to leverage them effectively together.


General Observations

Hadoop is used everywhere to process large amounts of data. Graph Databases on the other hand are all combining highly connected, high quality data from a variety of sources. Using Hadoop to efficiently pre-process, filter and aggregate raw information to be suitable for Neo4j imports is a reasonable approach.

Real world, log-, sensor-, transaction- and event data is noisy. Most of the data frames don’t add new information but are repetetive. For enriching a good graph model with variant information you want to filter out the noise and project the raw data into a format that can be easily ingested in your graph.

In the past, there were some approaches that used Hadoop to quickly generate Neo4j datastores directly. While this approach is performant, it is also tightly coupled to the store-format of a certain Neo4j version as it has to duplicate the functionality of writing to split-up store-files. With the parallel neo4j-import tool and APIs provided in Neo4j, such a solution is no longer needed. The import facilities scale across a large number of CPUs to maximize import performance.