Load / Import Apache Parquet

Library Requirements

The Apache Parquet procedures have dependencies on a client library that is not included in the APOC Extended library.

These dependencies are included in apoc-hadoop-dependencies-5.20.0-all.jar, which can be downloaded from the releases page.

Once that file is downloaded, it should be placed in the plugins directory and the Neo4j Server restarted.

Available Procedures

The table below describes the available procedures:

Name Description

apoc.load.parquet

Loads parquet from the provided Parquet file or binary

apoc.import.parquet

Imports parquet from the provided Parquet file or binary

Similar to the other procedures, the apoc.load.parquet just retrieve the Parquet result, while the apoc.import.parquet create nodes and relationships into the database.

These procedures are intended to be used together with the apoc.export.parquet.* procedures.

Configuration parameters

The procedures support the following config parameters:

Table 1. Config parameters
name type default description

batchSize

long

20000

the transaction batch size

mapping

Map

20000

to map complex files. See Mapping config section below

Usage

Given the following sample graph:

CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})
CREATE (Hugo:Person {name:'Hugo Weaving', born:1960})
CREATE (LillyW:Person {name:'Lilly Wachowski', born:1967})
CREATE (LanaW:Person {name:'Lana Wachowski', born:1965})
CREATE (JoelS:Person {name:'Joel Silver', born:1952})
CREATE
(Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
(Carrie)-[:ACTED_IN {roles:['Trinity']}]->(TheMatrix),
(Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix),
(Hugo)-[:ACTED_IN {roles:['Agent Smith']}]->(TheMatrix),
(LillyW)-[:DIRECTED]->(TheMatrix),
(LanaW)-[:DIRECTED]->(TheMatrix),
(JoelS)-[:PRODUCED]->(TheMatrix);

if we create a test.parquet via a CALL apoc.export.parquet.all('test.parquet') procedure, we can load the result by using:

CALL apoc.load.parquet('test.parquet')
Table 2. Results
value

{id: 0, tagline: "Welcome to the Real World", title: "The Matrix", released: 1999, labels: ["Movie"]

{id: 1, born: 1964, name: "Keanu Reeves", labels: ["Person"]}

{id: 2, born: 1967, name: "Carrie-Anne Moss", labels: ["Person"]}

{id: 3, born: 1961, name: "Laurence Fishburne", labels: ["Person"]}

{id: 4, born: 1960, name: "Hugo Weaving", labels: ["Person"]}

{id: 5, born: 1967, name: "Lilly Wachowski", labels: ["Person"]}

{id: 6, born: 1965, name: "Lana Wachowski", labels: ["Person"]}

{id: 7, born: 1952, name: "Joel Silver", labels: ["Person"]}

{type: "ACTED_IN", roles: ["Neo"], target_id: 0, __source_id: 1}

{type: "ACTED_IN", roles: ["Trinity"], target_id: 0, __source_id: 2}

{type: "ACTED_IN", roles: ["Morpheus"], target_id: 0, __source_id: 3}

{type: "ACTED_IN", roles: ["Agent Smith"], target_id: 0, __source_id: 4}

{type: "DIRECTED", target_id: 0, __source_id: 5}

{type: "DIRECTED", target_id: 0, __source_id: 6}

{type: "PRODUCED", target_id: 0, __source_id: 7}

Otherwise, we can re-import the test.parquet nodes/relationships by using:

CALL apoc.load.parquet('test.parquet')
Table 3. Results
file source format nodes relationships properties time rows batchSize batches data

"file:///import/testQuery.parquet"

"file"

"parquet"

8

7

0

0

0

0

0

null

The above procedure can also load/import from a Parquet byte array procuced by e.g. a CALL apoc.export.parquet.all.stream procedure. For example, the following procedures will produce the same results as the above ones:

Load procedure
// create a byte array
call apoc.export.parquet.all.stream()
YIELD value with value as bytes
// load the byte array
call apoc.load.parquet(bytes)
YIELD value return value
Import procedure
// create a byte array
CALL apoc.export.parquet.all.stream()
YIELD value with value as bytes
// import the byte array
CALL apoc.import.parquet(bytes)
YIELD source return source

Mapping config

In order to import complex types not supported by Parquet, like Point, Duration, List of Duration, etc.. we can use the mapping config to convert to the desired data type. For example, if we have a node (:MyLabel {durationProp: duration('P5M1.5D')}, and we export it in a parquet file/binary, we can import it by expliciting a map with key the property key, and value the property type.

That is in this example, by using the load procedure:

CALL apoc.load.parquet(fileOrBinary, {mapping: {durationProp: 'Duration'}})

Or with the import procedure:

CALL apoc.import.parquet(fileOrBinary, {mapping: {durationProp: 'Duration'}})

The mapping value types can be one of the following:

  • Point

  • LocalDateTime

  • LocalTime

  • DateTime

  • Time

  • Date

  • Duration

  • Char

  • Byte

  • Double

  • Float

  • Short

  • Int

  • Long

  • Node

  • Relationship

  • BaseType followed by Array, to map a list of values, where BaseType can be one of the previous type, for example DurationArray