Loading and streaming back data with Apache Arrow
Follow along with a notebook in Google Colab |
The Enterprise Edition of GDS installed on AuraDS includes an Arrow Flight server, configured and running by default. The Arrow Flight server speeds up data-intensive processes such as:
-
Creating a graph directly from in-memory data.
-
Streaming node and relationship properties.
-
Streaming the relationship topology of a graph.
There are two ways to use the Arrow Flight server with GDS:
-
By using the GDS Python client, which includes an Arrow Flight client.
-
By implementing a custom Arrow Flight client as explained in the GDS manual.
In the following examples we use the GDS client as it is the most convenient option. All the loading and streaming methods can be used without Arrow, but are more efficient if Arrow is available.
Setup
%pip install 'graphdatascience>=1.7'
from graphdatascience import GraphDataScience
# Replace with the actual connection URI and credentials
AURA_CONNECTION_URI = "neo4j+s://xxxxxxxx.databases.neo4j.io"
AURA_USERNAME = "neo4j"
AURA_PASSWORD = ""
# When initialized, the client tries to use Arrow if it is available on the server.
# This behaviour is controlled by the `arrow` parameter, which is set to `True` by default.
gds = GraphDataScience(AURA_CONNECTION_URI, auth=(AURA_USERNAME, AURA_PASSWORD), aura_ds=True)
# Necessary if Arrow is enabled (as is by default on Aura)
gds.set_database("neo4j")
You can call the gds.debug.arrow()
method to verify that Arrow is enabled and running:
gds.debug.arrow()
Loading data
You can load data directly into a graph using the gds.graph.construct
client method.
The data must be a Pandas DataFrame
, so we need to install and import the pandas
library.
%pip install pandas
import pandas as pd
We can then create a graph as in the following example.
The format of each DataFrame
with the required columns is specified in the GDS manual.
nodes = pd.DataFrame(
{
"nodeId": [0, 1, 2],
"labels": ["Article", "Article", "Article"],
"pages": [3, 7, 12],
}
)
relationships = pd.DataFrame(
{
"sourceNodeId": [0, 1],
"targetNodeId": [1, 2],
"relationshipType": ["CITES", "CITES"],
"times": [2, 1]
}
)
article_graph = gds.graph.construct(
"article-graph",
nodes,
relationships
)
Now we can check that the graph has been created:
gds.graph.list()
Streaming node and relationship properties
After creating the graph, you can read the node and relationship properties as streams.
# Read all the values for the node property `pages`
gds.graph.nodeProperties.stream(article_graph, "pages")
# Read all the values for the relationship property `times`
gds.graph.relationshipProperties.stream(article_graph, "times")
Performance
To see the difference in performance when Arrow is available, we can measure the time needed to load a dataset into a graph.
In this example we use a built-in OGBN dataset, so we need to install the ogb
extra.
%pip install 'graphdatascience[ogb]>=1.7'
# Load and immediately drop the dataset to download and cache the data
ogbn_arxiv = gds.graph.ogbn.load("ogbn-arxiv")
ogbn_arxiv.drop()
We can then time the loading process. On an 8 GB AuraDS instance, this should take less than 30 s.
%%timeit -n 1 -r 1
# This call uses the cached dataset, so only the actual loading is timed
ogbn_arxiv = gds.graph.ogbn.load("ogbn-arxiv")
With Arrow disabled by adding arrow=False
to the GraphDataScience
constructor, the same loading process would take more than 1 minute.
Therefore, with this dataset, Arrow provides at least a 2x speedup.