Loading Data from Web-APIs
Supported protocols are file
, http
, https
, s3
, gs
, hdfs
with redirect allowed.
If no procedure is provided, this procedure will try to check whether the URL is actually a file.
As apoc.import.file.use_neo4j_config is enabled, the procedures check whether file system access is allowed and possibly constrained to a specific directory by
reading the two configuration parameters dbms.security.allow_csv_import_from_file_urls and server.directories.import respectively.
If you want to remove these constraints please set apoc.import.file.use_neo4j_config=false
|
|
load from XML URL (e.g. web-api) to import XML as single nested map with attributes and |
|
load CSV fom URL as stream of values |
|
load XLS fom URL as stream of values |
Load Single File From Compressed File (zip/tar/tar.gz/tgz)
When loading data from compressed files, we need to put the !
character before the file name or path in the compressed file.
For example:
apoc.load.csv("pathToCompressedFile/file.zip!pathToCsvFileInZip/fileName.csv")
Using S3 protocol
When using the S3 protocol we need to download and copy the following jars into the plugins directory:
-
aws-java-sdk-core-1.12.136.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-core/1.12.136)
-
aws-java-sdk-s3-1.12.136.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3/1.12.136)
-
httpclient-4.5.13.jar (https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.5.13)
-
httpcore-4.4.15.jar (https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore/4.4.15)
-
joda-time-2.10.13.jar (https://mvnrepository.com/artifact/joda-time/joda-time/2.10.13)
Once those files have been copied we’ll need to restart the database.
The S3 URL must be in the following format:
-
s3://accessKey:secretKey[:sessionToken]@endpoint:port/bucket/key
(where the sessionToken is optional) or -
s3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey[&sessionToken=sessionToken]
(where the sessionToken is optional) or -
s3://endpoint:port/bucket/key
if the accessKey, secretKey, and the optional sessionToken are provided in the environment variables
Using hdfs protocol
To use the hdfs protocol we need to download and copy the additional jars not included in the APOC Extended library. We can download it from this link or locally downloading the apoc repository:
git clone http://github.com/neo4j-contrib/neo4j-apoc-procedures cd neo4j-apoc-procedures/extra-dependencies ./gradlew shadow
and a jar named apoc-hadoop-dependencies-5.26.0.jar
will be created into the neo4j-apoc-procedures/extra-dependencies/hadoop/build/lib
folder.
Once that file is downloaded/created, it should be placed in the plugins
directory and the Neo4j Server restarted.
Using Google Cloud Storage
In order to use Google Cloud Storage, you need to add the following Google Cloud dependencies in the plugins directory:
-
api-common-1.8.1.jar
-
failureaccess-1.0.1.jar
-
gax-1.48.1.jar
-
gax-httpjson-0.65.1.jar
-
google-api-client-1.30.2.jar
-
google-api-services-storage-v1-rev20190624-1.30.1.jar
-
google-auth-library-credentials-0.17.1.jar
-
google-auth-library-oauth2-http-0.17.1.jar
-
google-cloud-core-1.90.0.jar
-
google-cloud-core-http-1.90.0.jar
-
google-cloud-storage-1.90.0.jar
-
google-http-client-1.31.0.jar
-
google-http-client-appengine-1.31.0.jar
-
google-http-client-jackson2-1.31.0.jar
-
google-oauth-client-1.30.1.jar
-
grpc-context-1.19.0.jar
-
guava-28.0-android.jar
-
opencensus-api-0.21.0.jar
-
opencensus-contrib-http-util-0.21.0.jar
-
proto-google-common-protos-1.16.0.jar
-
proto-google-iam-v1-0.12.0.jar
-
protobuf-java-3.9.1.jar
-
protobuf-java-util-3.9.1.jar
-
threetenbp-1.3.3.jar
We’ve prepared an uber-jar that contains the above dependencies in a single file in order simplify the process. You can download it from here and copy it to your plugins directory.
You can use Google Cloud storage via the following url format:
gs://<bucket_name>/<file_path>
Moreover, you can also specify the authorization type via an additional authenticationType
query parameter:
-
NONE
: for public buckets (this is the default behavior if the parameter is not specified) -
GCP_ENVIRONMENT
: for passive authentication as a service account when Neo4j is running in the Google Cloud -
PRIVATE_KEY
: for using private keys generated for service accounts (requires settingGOOGLE_APPLICATION_CREDENTIALS
environment variable pointing to a private key json file as described here: https://cloud.google.com/docs/authentication#strategies)
Example:
gs://andrea-bucket-1/test-privato.csv?authenticationType=GCP_ENVIRONMENT