Load / Import Apache Parquet
Library Requirements
The Apache Parquet procedures have dependencies on a client library that is not included in the APOC Extended library.
These dependencies are included in apoc-hadoop-dependencies-5.21.0-all.jar, which can be downloaded from the releases page.
Once that file is downloaded, it should be placed in the plugins
directory and the Neo4j Server restarted.
Available Procedures
The table below describes the available procedures:
Name | Description |
---|---|
apoc.load.parquet |
Loads parquet from the provided Parquet file or binary |
apoc.import.parquet |
Imports parquet from the provided Parquet file or binary |
Similar to the other procedures, the apoc.load.parquet
just retrieve the Parquet result,
while the apoc.import.parquet
create nodes and relationships into the database.
These procedures are intended to be used together with the apoc.export.parquet.* procedures. |
Configuration parameters
The procedures support the following config parameters:
name | type | default | description |
---|---|---|---|
batchSize |
long |
20000 |
the transaction batch size |
mapping |
Map |
20000 |
to map complex files. See |
Usage
Given the following sample graph:
CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})
CREATE (Hugo:Person {name:'Hugo Weaving', born:1960})
CREATE (LillyW:Person {name:'Lilly Wachowski', born:1967})
CREATE (LanaW:Person {name:'Lana Wachowski', born:1965})
CREATE (JoelS:Person {name:'Joel Silver', born:1952})
CREATE
(Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
(Carrie)-[:ACTED_IN {roles:['Trinity']}]->(TheMatrix),
(Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix),
(Hugo)-[:ACTED_IN {roles:['Agent Smith']}]->(TheMatrix),
(LillyW)-[:DIRECTED]->(TheMatrix),
(LanaW)-[:DIRECTED]->(TheMatrix),
(JoelS)-[:PRODUCED]->(TheMatrix);
if we create a test.parquet
via a CALL apoc.export.parquet.all('test.parquet')
procedure,
we can load the result by using:
CALL apoc.load.parquet('test.parquet')
value |
---|
{id: 0, tagline: "Welcome to the Real World", title: "The Matrix", released: 1999, labels: ["Movie"] |
{id: 1, born: 1964, name: "Keanu Reeves", labels: ["Person"]} |
{id: 2, born: 1967, name: "Carrie-Anne Moss", labels: ["Person"]} |
{id: 3, born: 1961, name: "Laurence Fishburne", labels: ["Person"]} |
{id: 4, born: 1960, name: "Hugo Weaving", labels: ["Person"]} |
{id: 5, born: 1967, name: "Lilly Wachowski", labels: ["Person"]} |
{id: 6, born: 1965, name: "Lana Wachowski", labels: ["Person"]} |
{id: 7, born: 1952, name: "Joel Silver", labels: ["Person"]} |
{type: "ACTED_IN", roles: ["Neo"], target_id: 0, __source_id: 1} |
{type: "ACTED_IN", roles: ["Trinity"], target_id: 0, __source_id: 2} |
{type: "ACTED_IN", roles: ["Morpheus"], target_id: 0, __source_id: 3} |
{type: "ACTED_IN", roles: ["Agent Smith"], target_id: 0, __source_id: 4} |
{type: "DIRECTED", target_id: 0, __source_id: 5} |
{type: "DIRECTED", target_id: 0, __source_id: 6} |
{type: "PRODUCED", target_id: 0, __source_id: 7} |
Otherwise, we can re-import the test.parquet
nodes/relationships by using:
CALL apoc.load.parquet('test.parquet')
file | source | format | nodes | relationships | properties | time | rows | batchSize | batches | data |
---|---|---|---|---|---|---|---|---|---|---|
"file:///import/testQuery.parquet" |
"file" |
"parquet" |
8 |
7 |
0 |
0 |
0 |
0 |
0 |
null |
The above procedure can also load/import from a Parquet byte array procuced by e.g. a CALL apoc.export.parquet.all.stream
procedure.
For example, the following procedures will produce the same results as the above ones:
// create a byte array
call apoc.export.parquet.all.stream()
YIELD value with value as bytes
// load the byte array
call apoc.load.parquet(bytes)
YIELD value return value
// create a byte array
CALL apoc.export.parquet.all.stream()
YIELD value with value as bytes
// import the byte array
CALL apoc.import.parquet(bytes)
YIELD source return source
Mapping config
In order to import complex types not supported by Parquet, like Point, Duration, List of Duration, etc..
we can use the mapping config to convert to the desired data type.
For example, if we have a node (:MyLabel {durationProp: duration('P5M1.5D')}
, and we export it in a parquet file/binary,
we can import it by expliciting a map with key the property key, and value the property type.
That is in this example, by using the load procedure:
CALL apoc.load.parquet(fileOrBinary, {mapping: {durationProp: 'Duration'}})
Or with the import procedure:
CALL apoc.import.parquet(fileOrBinary, {mapping: {durationProp: 'Duration'}})
The mapping value types can be one of the following:
-
Point
-
LocalDateTime
-
LocalTime
-
DateTime
-
Time
-
Date
-
Duration
-
Char
-
Byte
-
Double
-
Float
-
Short
-
Int
-
Long
-
Node
-
Relationship
-
BaseType
followed by Array, to map a list of values, where BaseType can be one of the previous type, for exampleDurationArray