Loading Data from Web-APIs

Supported protocols are file, http, https, s3, gs, hdfs with redirect allowed.

If no procedure is provided, this procedure will try to check whether the URL is actually a file.

As apoc.import.file.use_neo4j_config is enabled, the procedures check whether file system access is allowed and possibly constrained to a specific directory by reading the two configuration parameters dbms.security.allow_csv_import_from_file_urls and server.directories.import respectively. If you want to remove these constraints please set apoc.import.file.use_neo4j_config=false

CALL apoc.load.xml('http://example.com/test.xml', ['xPath'], [config]) YIELD value as doc CREATE (p:Person) SET p.name = doc.name

load from XML URL (e.g. web-api) to import XML as single nested map with attributes and _type, _text and _children fields.

CALL apoc.load.csv('url',{sep:";"}) YIELD lineNo, list, strings, map, stringMap

load CSV fom URL as stream of values
config contains any of: {skip:1,limit:5,header:false,sep:'TAB',ignore:['aColumn'],arraySep:';',results:['map','list','strings','stringMap'], nullValues:[''],mapping:{years:{type:'int',arraySep:'-',array:false,name:'age',ignore:false,nullValues:['n.A.']}}

CALL apoc.load.xls('url','Sheet'/'Sheet!A2:B5',{config}) YIELD lineNo, list, map

load XLS fom URL as stream of values
config contains any of: {skip:1,limit:5,header:false,ignore:['aColumn'],arraySep:';'+ nullValues:[''],mapping:{years:{type:'int',arraySep:'-',array:false,name:'age',ignore:false,nullValues:['n.A.']}}

Load Single File From Compressed File (zip/tar/tar.gz/tgz)

When loading data from compressed files, we need to put the ! character before the file name or path in the compressed file. For example:

Loading a compressed CSV file

apoc.load.csv("pathToCompressedFile/file.zip!pathToCsvFileInZip/fileName.csv")

Using S3 protocol

When using the S3 protocol we need to download and copy the following jars into the plugins directory:

aws-java-sdk-core-1.12.136.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-core/1.12.136)
aws-java-sdk-s3-1.12.136.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3/1.12.136)
httpclient-4.5.13.jar (https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.5.13)
httpcore-4.4.15.jar (https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore/4.4.15)
joda-time-2.10.13.jar (https://mvnrepository.com/artifact/joda-time/joda-time/2.10.13)

Once those files have been copied we’ll need to restart the database.

The S3 URL must be in the following format:

s3://accessKey:secretKey[:sessionToken]@endpoint:port/bucket/key (where the sessionToken is optional) or
s3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey[&sessionToken=sessionToken] (where the sessionToken is optional) or
s3://endpoint:port/bucket/key if the accessKey, secretKey, and the optional sessionToken are provided in the environment variables

S3 protocol without SSL certificates checking

To run S3 on providers that do not need SSL certificates, such as a minio container created in the following way:

docker run -p 9000:9000 -p 9001:9001 \
  -e "MINIO_ROOT_USER=accessTestKey" \
  -e "MINIO_ROOT_PASSWORD=secretTestKey" \
  -e "MINIO_DEFAULT_BUCKETS=test" \
  bitnami/minio:2025.1.20

we need to put in the neo4j.conf this setting:

server.jvm.additional=-Dcom.amazonaws.sdk.disableCertChecking=true

Therefore, using this setting and the above minio container, the correct way to retrieve CSV file named foo.csv from the test bucket is:

CALL apoc.load.csv('s3://127.0.0.1:9000/test/foo.csv?accessKey=accessTestKey&secretKey=secretTestKey')

Using hdfs protocol

To use the hdfs protocol we need to download and copy the additional jars not included in the APOC Extended library. We can download it from this link or locally downloading the apoc repository:

git clone http://github.com/neo4j-contrib/neo4j-apoc-procedures
cd neo4j-apoc-procedures/extra-dependencies
./gradlew shadow

and a jar named apoc-hadoop-dependencies-5.26.1.jar will be created into the neo4j-apoc-procedures/extra-dependencies/hadoop/build/lib folder.

Once that file is downloaded/created, it should be placed in the plugins directory and the Neo4j Server restarted.

Using Google Cloud Storage

In order to use Google Cloud Storage, you need to add the following Google Cloud dependencies in the plugins directory:

api-common-1.8.1.jar
failureaccess-1.0.1.jar
gax-1.48.1.jar
gax-httpjson-0.65.1.jar
google-api-client-1.30.2.jar
google-api-services-storage-v1-rev20190624-1.30.1.jar
google-auth-library-credentials-0.17.1.jar
google-auth-library-oauth2-http-0.17.1.jar
google-cloud-core-1.90.0.jar
google-cloud-core-http-1.90.0.jar
google-cloud-storage-1.90.0.jar
google-http-client-1.31.0.jar
google-http-client-appengine-1.31.0.jar
google-http-client-jackson2-1.31.0.jar
google-oauth-client-1.30.1.jar
grpc-context-1.19.0.jar
guava-28.0-android.jar
opencensus-api-0.21.0.jar
opencensus-contrib-http-util-0.21.0.jar
proto-google-common-protos-1.16.0.jar
proto-google-iam-v1-0.12.0.jar
protobuf-java-3.9.1.jar
protobuf-java-util-3.9.1.jar
threetenbp-1.3.3.jar

We’ve prepared an uber-jar that contains the above dependencies in a single file in order simplify the process. You can download it from here and copy it to your plugins directory.

You can use Google Cloud storage via the following url format:

gs://<bucket_name>/<file_path>

Moreover, you can also specify the authorization type via an additional authenticationType query parameter:

NONE: for public buckets (this is the default behavior if the parameter is not specified)
GCP_ENVIRONMENT: for passive authentication as a service account when Neo4j is running in the Google Cloud
PRIVATE_KEY: for using private keys generated for service accounts (requires setting GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to a private key json file as described here: https://cloud.google.com/docs/authentication#strategies)

Example:

gs://andrea-bucket-1/test-privato.csv?authenticationType=GCP_ENVIRONMENT

Fail on Error

Adding the config parameter failOnError:false (by default true), will mean that in the case of an error the procedure will not fail, but just return zero rows.