Load / Import Apache Parquet

Library Requirements

The Apache Parquet procedures have dependencies on a client library that is not included in the APOC Extended library.

These dependencies are included in apoc-hadoop-dependencies-5.26.1-all.jar, which can be downloaded from the releases page.

Once that file is downloaded, it should be placed in the plugins directory and the Neo4j Server restarted.

Available Procedures

The table below describes the available procedures:

Name	Description
apoc.load.parquet	Loads parquet from the provided Parquet file or binary
apoc.import.parquet	Imports parquet from the provided Parquet file or binary

Name

Description

apoc.load.parquet

Loads parquet from the provided Parquet file or binary

apoc.import.parquet

Imports parquet from the provided Parquet file or binary

Similar to the other procedures, the apoc.load.parquet just retrieve the Parquet result, while the apoc.import.parquet create nodes and relationships into the database.

These procedures are intended to be used together with the apoc.export.parquet.* procedures.

Configuration parameters

The procedures support the following config parameters:

Table 1. Config parameters
name	type	default	description
batchSize	long	20000	the transaction batch size
mapping	Map	20000	to map complex files. See `Mapping config` section below

Usage

Given the following sample graph:

CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})
CREATE (Hugo:Person {name:'Hugo Weaving', born:1960})
CREATE (LillyW:Person {name:'Lilly Wachowski', born:1967})
CREATE (LanaW:Person {name:'Lana Wachowski', born:1965})
CREATE (JoelS:Person {name:'Joel Silver', born:1952})
CREATE
(Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
(Carrie)-[:ACTED_IN {roles:['Trinity']}]->(TheMatrix),
(Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix),
(Hugo)-[:ACTED_IN {roles:['Agent Smith']}]->(TheMatrix),
(LillyW)-[:DIRECTED]->(TheMatrix),
(LanaW)-[:DIRECTED]->(TheMatrix),
(JoelS)-[:PRODUCED]->(TheMatrix);

if we create a test.parquet via a CALL apoc.export.parquet.all('test.parquet') procedure, we can load the result by using:

CALL apoc.load.parquet('test.parquet')

Table 2. Results
value
{id: 0, tagline: "Welcome to the Real World", title: "The Matrix", released: 1999, labels: ["Movie"]
{id: 1, born: 1964, name: "Keanu Reeves", labels: ["Person"]}
{id: 2, born: 1967, name: "Carrie-Anne Moss", labels: ["Person"]}
{id: 3, born: 1961, name: "Laurence Fishburne", labels: ["Person"]}
{id: 4, born: 1960, name: "Hugo Weaving", labels: ["Person"]}
{id: 5, born: 1967, name: "Lilly Wachowski", labels: ["Person"]}
{id: 6, born: 1965, name: "Lana Wachowski", labels: ["Person"]}
{id: 7, born: 1952, name: "Joel Silver", labels: ["Person"]}
{type: "ACTED_IN", roles: ["Neo"], target_id: 0, __source_id: 1}
{type: "ACTED_IN", roles: ["Trinity"], target_id: 0, __source_id: 2}
{type: "ACTED_IN", roles: ["Morpheus"], target_id: 0, __source_id: 3}
{type: "ACTED_IN", roles: ["Agent Smith"], target_id: 0, __source_id: 4}
{type: "DIRECTED", target_id: 0, __source_id: 5}
{type: "DIRECTED", target_id: 0, __source_id: 6}
{type: "PRODUCED", target_id: 0, __source_id: 7}

Otherwise, we can re-import the test.parquet nodes/relationships by using:

CALL apoc.load.parquet('test.parquet')

Table 3. Results
file	source	format	nodes	relationships	properties	time	rows	batchSize	batches	data
"file:///import/testQuery.parquet"	"file"	"parquet"	8	7	0	0	0	0	0	null

The above procedure can also load/import from a Parquet byte array procuced by e.g. a CALL apoc.export.parquet.all.stream procedure. For example, the following procedures will produce the same results as the above ones:

Load procedure

// create a byte array
call apoc.export.parquet.all.stream()
YIELD value with value as bytes
// load the byte array
call apoc.load.parquet(bytes)
YIELD value return value

Import procedure

// create a byte array
CALL apoc.export.parquet.all.stream()
YIELD value with value as bytes
// import the byte array
CALL apoc.import.parquet(bytes)
YIELD source return source

Mapping config

In order to import complex types not supported by Parquet, like Point, Duration, List of Duration, etc.. we can use the mapping config to convert to the desired data type. For example, if we have a node (:MyLabel {durationProp: duration('P5M1.5D')}, and we export it in a parquet file/binary, we can import it by expliciting a map with key the property key, and value the property type.

That is in this example, by using the load procedure:

CALL apoc.load.parquet(fileOrBinary, {mapping: {durationProp: 'Duration'}})

Or with the import procedure:

CALL apoc.import.parquet(fileOrBinary, {mapping: {durationProp: 'Duration'}})

The mapping value types can be one of the following:

Point
LocalDateTime
LocalTime
DateTime
Time
Date
Duration
Char
Byte
Double
Float
Short
Int
Long
Node
Relationship
BaseType followed by Array, to map a list of values, where BaseType can be one of the previous type, for example DurationArray