Graph construct: Import from Pandas
This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.
The notebook shows the usage of the gds.graph.construct
method
(available only in GDS 2.1+) to build a graph directly in memory.
If you are using AuraDS, it is currently not possible to write the projected graph back to Neo4j. |
1. Setup
We need an environment where Neo4j and GDS are available, for example AuraDS (which comes with GDS preinstalled) or Neo4j Desktop.
Once the credentials to this environment are available, we can install
the graphdatascience
package and import the client class.
%pip install graphdatascience
import os
from graphdatascience import GraphDataScience
When using a local Neo4j setup, the default connection URI is
bolt://localhost:7687
:
# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
NEO4J_AUTH = (
os.environ.get("NEO4J_USER"),
os.environ.get("NEO4J_PASSWORD"),
)
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)
When using AuraDS, the connection URI is slightly different as it uses
the neo4j+s
protocol. The client should also include the
aura_ds=True
flag to enable AuraDS-recommended settings.
# On AuraDS:
#
# gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB, aura_ds=True)
from graphdatascience import ServerVersion
assert gds.server_version() >= ServerVersion(2, 1, 0)
We also import pandas
to create a Pandas DataFrame
from the
original data source.
import pandas as pd
2. Load the Cora dataset
CORA_CONTENT = "https://data.neo4j.com/cora/cora.content"
CORA_CITES = "https://data.neo4j.com/cora/cora.cites"
We can load each CSV locally as a Pandas DataFrame
.
content = pd.read_csv(CORA_CONTENT, header=None)
cites = pd.read_csv(CORA_CITES, header=None)
We need to perform an additional preprocessing step to convert the
subject
field (which is a string in the dataset) into an integer,
because node properties have to be numerical in order to be projected
into a graph. We can use a map for this.
SUBJECT_TO_ID = {
"Neural_Networks": 0,
"Rule_Learning": 1,
"Reinforcement_Learning": 2,
"Probabilistic_Methods": 3,
"Theory": 4,
"Genetic_Algorithms": 5,
"Case_Based": 6,
}
We can now create a new DataFrame
with a nodeId
field, a list of
node labels, and the additional node properties subject
(using the
SUBJECT_TO_ID
mapping) and features
(converting all the feature
columns to a single array column).
nodes = pd.DataFrame().assign(
nodeId=content[0],
labels="Paper",
subject=content[1].replace(SUBJECT_TO_ID),
features=content.iloc[:, 2:].apply(list, axis=1),
)
Let’s check the first 5 rows of the new DataFrame
:
nodes.head()
Now we create a new DataFrame
containing the relationships between
the nodes. To create the equivalent of an undirected graph, we need to
add direct and inverse relationships explicitly.
dir_relationships = pd.DataFrame().assign(sourceNodeId=cites[0], targetNodeId=cites[1], relationshipType="CITES")
inv_relationships = pd.DataFrame().assign(sourceNodeId=cites[1], targetNodeId=cites[0], relationshipType="CITES")
relationships = pd.concat([dir_relationships, inv_relationships]).drop_duplicates()
Again, let’s check the first 5 rows of the new DataFrame
:
relationships.head()
Finally, we can create the in-memory graph.
G = gds.graph.construct("cora-graph", nodes, relationships)
3. Use the graph
Let’s check that the new graph has been created:
gds.graph.list()
Let’s also count the nodes in the graph:
G.node_count()
The count matches with the number of rows in the Pandas dataset:
len(content)
We can stream the value of the subject
node property for each node
in the graph, printing only the first 10.
gds.graph.nodeProperties.stream(G, ["subject"]).head(10)