Load Data from Web-APIs

Supported protocols are file, http, https, s3, gs, hdfs with redirect allowed.

If no procedure is provided, this procedure will try to check whether the URL is actually a file.

As apoc.import.file.use_neo4j_config is enabled, the procedures check whether file system access is allowed and possibly constrained to a specific directory by reading the two configuration parameters dbms.security.allow_csv_import_from_file_urls and dbms.directories.import respectively. If you want to remove these constraints please set apoc.import.file.use_neo4j_config=false

Procedure Description

Procedure	Description
`CALL apoc.load.json('http://example.com/map.json', [path], [config]) YIELD value as person`	load JSON from URL
`CALL apoc.load.xml('http://example.com/test.xml', ['xPath'], [config]) YIELD value as doc`	load XML from URL
`CALL apoc.load.csv('url',{sep:";"}) YIELD lineNo, list, strings, map, stringMap`	load CSV fom URL
`CALL apoc.load.xls('url','Sheet'/'Sheet!A2:B5',{config}) YIELD lineNo, list, map`	load XLS fom URL

CALL apoc.load.json('http://example.com/map.json', [path], [config]) YIELD value as person

load JSON from URL

CALL apoc.load.xml('http://example.com/test.xml', ['xPath'], [config]) YIELD value as doc

load XML from URL

CALL apoc.load.csv('url',{sep:";"}) YIELD lineNo, list, strings, map, stringMap

load CSV fom URL

CALL apoc.load.xls('url','Sheet'/'Sheet!A2:B5',{config}) YIELD lineNo, list, map

load XLS fom URL

Adding failOnError:false (by default true) to the config map when using any of the procedures in the above table will make them not fail in case of an error. The procedure will instead return zero rows. For example:

CALL apoc.load.json('http://example.com/test.json', null, {failOnError:false})

Load from Compressed File (zip/tar/tar.gz/tgz)

When loading a file that has been compressed, the compression algorithm has to be provided in the configuration options. For example, in the following case, if xmlCompressed was a .gzip extension file, the configuration options {compression: 'GZIP'} need to be supplied to the procedure call to load the root of the document / into a Cypher map in memory:

CALL apoc.load.xml(xmlCompressed, '/', {compression: 'GZIP'})

For other valid compression configuration values, refer to the documentation of apoc.load.xml.

By default, the size of a decompressed file is limited to 200 times its compressed size. That number can be changed by adjusting the configuration option apoc.max.decompression.ratio in the apoc.conf (it cannot be 0 as that would make decompression impossible). If a negative number is given, there is no limit to how big a decompressed size can be. This exposes the database to potential zip bomb attacks.

Trying to load an uncompressed file that exceeds the relative ratio with respect to the original compressed file will generate the following message:

The file dimension exceeded maximum size in bytes, 250000,
which is 250 times the width of the original file.
The InputStream has been blocked because the file could be a compression bomb attack.

Load Single File From Compressed File (zip/tar/tar.gz/tgz)

When loading data from compressed files, we need to put the ! character before the file name or path in the compressed file. For example:

Loading a compressed CSV file

CALL apoc.load.csv("pathToCompressedFile/file.zip!pathToCsvFileInZip/fileName.csv")

Loading a compressed JSON file

CALL apoc.load.json("https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/4.4/core/src/test/resources/testload.tgz?raw=true!person.json");

Using S3, GCS or HDFS protocols

To use any of these protocols, additional extra dependency jars need to be downloaded and copied into the plugins directory <NEO4J_HOME>/plugins, respectively:

Protocol

Needed extra dependency

apoc-aws-dependencies-4.4.0.38.jar

GCS

apoc-gcs-dependencies-4.4.0.38.jar

HDFS

apoc-hadoop-dependencies-4.4.0.38.jar

After copying the jars into the plugins directory, the database will need to be restarted.

Using S3 protocol

The S3 URL must be in the following format:

s3://accessKey:secretKey[:sessionToken]@endpoint:port/bucket/key (where the sessionToken is optional) or
s3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey[&sessionToken=sessionToken] (where the sessionToken is optional) or
s3://endpoint:port/bucket/key if the accessKey, secretKey, and the optional sessionToken are provided in the environment variables

Using Google Cloud Storage

Google Cloud Storage urls have the following shape:

gs://<bucket_name>/<file_path>

The authorization type can be specified by an additional authenticationType query parameter:

NONE: for public buckets (this is the default behavior if the parameter is not specified)
GCP_ENVIRONMENT: for passive authentication as a service account when Neo4j is running in the Google Cloud
PRIVATE_KEY: for using private keys generated for service accounts (requires setting GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to a private key JSON file as described by the official Google documentation.)

Example:

gs://bucket/test-file.csv?authenticationType=GCP_ENVIRONMENT