Load CSV
If we have large files locally or on http(s)/s3/gcp storage that you want to import into neo4j, we can use this procedure that can provide a URL with an offset that can just read a string from a file (e.g. CSV). We can also set an optional limit as the 3rd parameter, otherwise (with 0) it read until the end.
apoc.load.stringPartial(urlOrBinary :: ANY?, offset :: LONG, limit :: LONG = 0, config = {} :: MAP?) :: (value :: STRING?)
For reading from files you’ll have to enable the config option:
apoc.import.file.enabled=true
The procedure support the following config parameters:
name | type | default | description |
---|---|---|---|
headers |
Map<String, Object> |
Empty map |
Additional headers to be added or replaced to the default |
archiveLimit |
int |
1024*1024*10 (10MB) |
Size limit to locate ZIP entries and buffers |
bufferLimit |
int |
1024*1024*10 (10MB) |
Buffer read limit |
compression |
Enum[NONE, GZIP, BZIP2, DEFLATE, BLOCK_LZ4, FRAMED_SNAPPY] |
NONE |
Set the compression algorithm used, in case of a byte array reading. |
Usage examples
We can read a portion of a string from a local file URL, a remote URL (i.e. http(s)/gcp/S3/Azure/Hdfs), a local/remote file placed in an archive, or a byte array. It’s useful compared to other load procedures since the access into the file is handled more efficiently, not an openStream and a read to location.
That is:
-
in case of a local file under-the-hood an RandomAccessFile will be created.
-
in case of http(s) URL we will put an additional HTTP header
Range: bytes=<offset>
, while in case of limit set it will beRange: bytes=<offset>-<httpLimit>
, where httpLimit is equal tooffset + limit - 1
. -
in case of S3 location, an GetObjectRequest.range() will be used.
-
in the other cases, we will execute an InputStream.skip()
If we have the following CSV file
name,age,beverage Selma,9,Soda Rana,12,Tea;Milk Selina,19,Cola
We can execute:
CALL apoc.load.stringPartial("path/to/localfile/test.csv", 17, 15)
value |
---|
Rana,11 Selina, |
Or also, without limit set:
CALL apoc.load.stringPartial("path/to/localfile/test.csv", 17)
value |
---|
Rana,11 Selina,18 |
We can read in the same way and with a similar result from a remote URL, for example:
CALL apoc.load.stringPartial("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/refs/heads/dev/extended/src/test/resources/test.csv", 17)
We can also read from an archive file, using the syntax <pathToArchive>!<fileToRead>
, for example:
CALL apoc.load.stringPartial("https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip!Data8277.csv", 17)
Or also from a byte array, optionally setting the compression type (default 'NONE', that is not compressed), for example using the apoc.util.compress
(placed in APOC Core):
WITH apoc.util.compress('testFooBar', {compression: 'DEFLATE'}) AS compressed
CALL apoc.load.stringPartial(compressed, 5, 17, {compression: 'DEFLATE'}) YIELD value RETURN value
value |
---|
testFooBar |