This section details the graph catalog operations available to manage named graph projections within the Neo4j Graph Data Science library.
Graph algorithms run on a graph data model which is a projection of the Neo4j property graph data model. A graph projection can be seen as a view over the stored graph, containing only analytically relevant, potentially aggregated, topological and property information. Graph projections are stored entirely in-memory using compressed data structures optimized for topology and property lookup operations.
The graph catalog is a concept within the GDS library that allows managing multiple graph projections by name. Using its name, a created graph can be used many times in the analytical workflow. Named graphs can be created using either a Native projection or a Cypher projection. After usage, named graphs can be removed from the catalog to free up main memory.
Graphs can also be created when running an algorithm without placing them in the catalog. We refer to such graphs as anonymous graphs.
The graph catalog exists as long as the Neo4j instance is running. When Neo4j is restarted, graphs stored in the catalog are lost and need to be re-created. |
This chapter explains the available graph catalog operations.
Name | Description |
---|---|
Creates a graph in the catalog using a Native projection. |
|
Creates a graph in the catalog using a Cypher projection. |
|
Prints information about graphs that are currently stored in the catalog. |
|
Checks if a named graph is stored in the catalog. |
|
Removes node properties from a named graph. |
|
Deletes relationships of a given relationship type from a named graph. |
|
Drops a named graph from the catalog. |
|
Streams a single node property stored in a named graph. |
|
Streams node properties stored in a named graph. |
|
Streams a single relationship property stored in a named graph. |
|
Streams relationship properties stored in a named graph. |
|
Writes node properties stored in a named graph to Neo4j. |
|
Writes relationships stored in a named graph to Neo4j. |
|
Exports a named graph into a new offline Neo4j database. |
|
Exports a named graph into CSV files. |
Creating, using, listing, and dropping named graphs are management operations bound to a Neo4j user. Graphs created by a different Neo4j user are not accessible at any time. |
A projected graph can be stored in the catalog under a user-defined name. Using that name, the graph can be referred to by any algorithm in the library. This allows multiple algorithms to use the same graph without having to re-create it on each algorithm run.
There are two variants of projecting a graph from the Neo4j database into main memory:
There is also a way to generate a random graph, see Graph Generation documentation for more details. |
In this section, we will give brief examples on how to create a graph using either variant. For detailed information about the configuration of each variant, we refer to the dedicated sections.
In the following two examples we show how to create a graph called my-native-graph
that contains Person
nodes and LIKES
relationships.
Create a graph using a native projection:
CALL gds.graph.create(
'my-native-graph',
'Person',
'LIKES'
)
YIELD graphName, nodeCount, relationshipCount, createMillis;
We can also use Cypher to select the nodes and relationships to be projected into the in-memory graph.
Create a graph using a Cypher projection:
CALL gds.graph.create.cypher(
'my-cypher-graph',
'MATCH (n:Person) RETURN id(n) AS id',
'MATCH (a:Person)-[:LIKES]->(b:Person) RETURN id(a) AS source, id(b) AS target'
)
YIELD graphName, nodeCount, relationshipCount, createMillis;
After creating the graphs in the catalog, we can refer to them in algorithms by using their name.
Run Page Rank on one of our created graphs:
CALL gds.pageRank.stream('my-native-graph') YIELD nodeId, score;
Information about graphs in the catalog can be listed using the gds.graph.list()
procedure.
The procedure takes an optional parameter graphName
:
List information about graphs in the catalog:
CALL gds.graph.list(
graphName: String?
) YIELD
graphName,
database,
nodeProjection,
relationshipProjection,
nodeQuery,
relationshipQuery,
nodeCount,
relationshipCount,
schema,
degreeDistribution,
density,
creationTime,
modificationTime,
sizeInBytes,
memoryUsage;
Name | Type | Description |
---|---|---|
|
String |
Name of the graph. |
|
String |
Name of the database in which the graph has been created. |
|
Map |
Node projection used to create the graph. If a Cypher projection was used, this will be a derived node projection. |
|
Map |
Relationship projection used to create the graph. If a Cypher projection was used, this will be a derived relationship projection. |
|
String |
Node query used to create the graph. If a native projection was used, this will be |
|
String |
Relationship query used to create the graph. If a native projection was used, this will be |
|
Integer |
Number of nodes in the graph. |
|
Integer |
Number of relationships in the graph. |
|
Map |
Node labels, Relationship types and properties contained in the in-memory graph. |
|
Map |
Histogram of degrees in the graph. |
|
Float |
Density of the graph. |
|
Datetime |
Time when the graph was created. |
|
Datetime |
Time when the graph was last modified. |
|
Integer |
Number of bytes used in the Java heap to store the graph. |
|
String |
Human readable description of |
The information contains basic statistics about the graph, e.g., the node and relationship count.
The result field creationTime
indicates when the graph was created in memory.
The result field modificationTime
indicates when the graph was updated by an algorithm running in mutate
mode.
The database
column refers to the name of the database the corresponding graph has been created on.
Referring to a named graph in a procedure is only allowed on the database it has been created on.
The schema
consists of information about the nodes and relationships stored in the graph.
For each node label, the schema maps the label to its property keys and their corresponding property types.
Similarly, the schema maps the relationship types to their property keys and property types.
The property type is either Integer
, Float
, List of Integer
or List of Float
.
The degreeDistribution
field can be fairly time-consuming to compute for larger graphs.
Its computation is cached per graph, so subsequent listing for the same graph will be fast.
To avoid computing the degree distribution, specify a YIELD
clause that omits it.
Note that not specifying a YIELD
clause is the same as requesting all possible return fields to be returned.
The density
is the result of relationshipCount
divided by the maximal number of relationships for a simple graph with the given nodeCount
.
List basic information about all graphs in the catalog:
CALL gds.graph.list()
YIELD graphName, nodeCount, relationshipCount, schema;
List extended information about a specific named graph in the catalog:
CALL gds.graph.list('my-cypher-graph')
YIELD graphName, nodeQuery, relationshipQuery, nodeCount, relationshipCount, schema, creationTime, modificationTime, memoryUsage;
List all information about a specific named graph in the catalog:
CALL gds.graph.list('my-native-graph')
List information about the degree distribution of a specific graph:
CALL gds.graph.list('my-cypher-graph')
YIELD graphName, degreeDistribution;
We can check if a graph is stored in the catalog by looking up its name.
Check if a graph exists in the catalog:
CALL gds.graph.exists('my-store-graph') YIELD exists;
We can remove node properties from a named graph in the catalog. This is useful to free up main memory or to remove accidentally created node properties.
Remove multiple node properties from a named graph:
CALL gds.graph.removeNodeProperties('my-graph', ['pageRank', 'communityId'])
The above example requires all given properties to be present on at least one node projection, and the properties will be removed from all such projections.
The procedure can be configured to remove just the properties for some specific node projections. In the following example, we ran an algorithm on a sub-graph and subsequently remove the newly created property.
Remove node properties of a specific node projection:
CALL gds.graph.create('my-graph', ['A', 'B'], '*')
CALL gds.wcc.mutate('my-graph', {nodeLabels: ['A'], mutateProperty: 'componentId'})
CALL gds.graph.removeNodeProperties('my-graph', ['componentId'], ['A'])
When a list of projections that are not *
is specified, as in the example above, a different validation and execution is applied;
It is then required that all projections have all of the given properties, and they will be removed from all of the projections.
If any of the given projections is '*'
, the procedure behaves like in the first example.
We can delete all relationships of a given type from a named graph in the catalog. This is useful to free up main memory or to remove accidentally created relationship types.
Delete all relationships of type T from a named graph:
CALL gds.graph.deleteRelationships('my-graph', 'T')
YIELD graphName, relationshipType, deletedRelationships, deletedProperties
Once we have finished using the named graph we can remove it from the catalog to free up memory.
Remove a graph from the catalog:
CALL gds.graph.drop('my-store-graph') YIELD graphName;
If we want the procedure to fail silently on non-existing graphs, we can set a boolean flag as the second parameter to false. This will yield an empty result for non-existing graphs.
Try removing a graph from the catalog:
CALL gds.graph.drop('my-fictive-graph', false) YIELD graphName;
We can stream node properties stored in a named in-memory graph back to the user.
This is useful if we ran multiple algorithms in mutate
mode and want to retrieve some or all of the results.
This is similar to what the stream
execution mode does, but allows more fine-grained control over the operations.
Stream multiple node properties:
CALL gds.graph.streamNodeProperties('my-graph', ['componentId', 'pageRank', 'communityId'])
The above example requires all given properties to be present on at least one node projection, and the properties will be streamed for all such projections.
The procedure can be configured to stream just the properties for some specific node projections. In the following example, we ran an algorithm on a sub-graph and subsequently streamed the newly created property.
Stream node properties of a specific node projection:
CALL gds.graph.create('my-graph', ['A', 'B'], '*')
CALL gds.wcc.mutate('my-graph', {nodeLabels: ['A'], mutateProperty: 'componentId'})
CALL gds.graph.streamNodeProperties('my-graph', ['componentId'], ['A'])
When a list of projections that are not *
is specified, as in the example above, a different validation and execution is applied.
It is then required that all projections have all of the given properties, and they will be streamed for all of the projections.
If any of the given projections is '*'
, the procedure behaves like in the first example.
When streaming multiple node properties, the name of each property is included in the result.
This adds with some overhead, as each property name must be repeated for each node in the result, but is necessary in order
to distinguish properties.
For streaming a single node property this is not necessary.
gds.graph.streamNodeProperty()
streams a single node property from the in-memory graph, and omits the property name.
The result has the format nodeId
, propertyValue
, as is familiar from the streaming mode of many algorithm procedures.
Stream a single node property:
CALL gds.graph.streamNodeProperty('my-graph', 'componentId')
We can stream relationship properties stored in a named in-memory graph back to the user.
This is useful if we ran multiple algorithms in mutate
mode and want to retrieve some or all of the results.
This is similar to what the stream
execution mode does, but allows more fine-grained control over the operations.
Stream multiple relationship properties:
CALL gds.graph.streamRelationshipProperties('my-graph', ['similarityScore', 'weight'])
The procedure can be configured to stream just the properties for some specific relationship projections. In the following example, we ran an algorithm on a sub-graph and subsequently streamed the newly created property.
Stream relationship properties of a specific relationship projection:
CALL gds.graph.create('my-graph', ['*'], [A', 'B'])
CALL gds.nodeSimiliarity.mutate('my-graph', {relationshipTypes: ['A'], mutateRelationshipType: 'R', mutateProperty: 'similarityScore'})
CALL gds.graph.streamNodeProperties('my-graph', ['similarityScore'], ['R'])
When a list of projections that are not *
is specified, as in the example above, a different validation and execution is applied.
It is then required that all projections have all of the given properties, and they will be streamed for all of the projections.
If any of the given projections is '*'
, the procedure behaves like in the first example.
When streaming multiple relationship properties, the name of the relationship type and of each property is included in the
result.
This adds with some overhead, as each type name and property name must be repeated for each relationship in the result, but
is necessary in order to distinguish properties.
For streaming a single relationship property, the property name can be left out.
gds.graph.streamNodeProperty()
streams a single relationship property from the in-memory graph, and omits the property name.
The result has the format sourceNodeId
, targetNodeId
, relationshipType
, propertyValue
.
Stream a single relationship property:
CALL gds.graph.streamRelationshipProperty('my-graph', 'similarityScore')
Similar to streaming properties stored in an in-memory graph it is also possible to write those back to Neo4j.
This is similar to what the write
execution mode does, but allows more fine-grained control over the operations.
The properties to write are typically the mutateProperty
values that were used when running algorithms.
Properties that were added to the created graph at creation time will often already be present in the Neo4j database.
Write node properties to Neo4j:
CALL gds.graph.writeNodeProperties(
graphName: String,
nodeProperties: List<String>,
nodeLabels: List<String>,
configuration: Map
) YIELD
graphName: String,
nodeProperties: List<String>,
writeMillis: Integer,
propertiesWritten: Integer
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
nodeProperties |
List<String> |
|
no |
Names of properties to write. |
nodeLabels |
List<String> |
|
yes |
Names of labels to write properties for. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
concurrency |
Integer |
|
yes |
The number of concurrent threads used for writing the properties to Neo4j. |
writeConcurrency |
Integer |
|
yes |
The number of concurrent threads used for writing the properties to Neo4j. If both |
Name | Type | Description |
---|---|---|
|
String |
Name of the graph. |
|
List<String> |
Names of written properties. |
|
Integer |
Milliseconds for writing properties to Neo4j. |
|
Integer |
Number of properties written. |
To write the properties 'componentId'
, 'pageRank'
, 'communityId'
for all node projections in the graph 'my-graph'
using 8 concurrent threads, use the following query:
Write multiple node properties to Neo4j:
CALL gds.graph.writeNodeProperties(
'my-graph',
['componentId', 'pageRank', 'communityId'],
['*'],
{writeConcurrency: 8}
)
The above example requires all given properties to be present on at least one node projection, and the properties will be written for all such projections.
The procedure can be configured to write just the properties for some specific node projections. In the following example, we run an algorithm on a sub-graph and subsequently write the newly created property to Neo4j.
Write node properties of a specific node projection to Neo4j:
CALL gds.graph.create('my-graph', ['A', 'B'], '*')
CALL gds.wcc.mutate('my-graph', {nodeLabels: ['A'], mutateProperty: 'componentId'})
CALL gds.graph.writeNodeProperties('my-graph', ['componentId'], ['A'])
When a list of projections not including the star projection ('*'
) is specified, as in the example above, a different validation and execution is applied.
In this case, it is required that all projections have all of the given properties, and they will be written to Neo4j for
all of the projections.
If any of the given projections is the star projection, the procedure behaves like in the first example.
We can write relationships stored in a named in-memory graph back to Neo4j. This can be used to write algorithm results (for example from Node Similarity) or relationships that have been aggregated during graph creation.
The relationships to write are specified by a relationship type.
This can either be an element identifier used in a relationship projection during graph construction or the writeRelationshipType
used in algorithms that create relationships.
Relationships are always written using a single thread.
Write relationships to Neo4j:
CALL gds.graph.writeRelationship('my-graph', 'SIMILAR_TO')
By default, no relationship properties will be written. To write relationship properties, these have to be explicitly specified.
Write relationships and their properties to Neo4j:
CALL gds.graph.writeRelationship('my-graph', 'SIMILAR_TO', 'similarityScore')
We can create new Neo4j databases from named in-memory graphs stored in the graph catalog.
All nodes, relationships and properties present in an in-memory graph are written to a new Neo4j database.
This includes data that has been projected in gds.graph.create
and data that has been added by running algorithms in mutate
mode.
The newly created database will be stored in the Neo4j databases
directory using a given database name.
The feature is useful in the following, exemplary scenarios:
Export a named graph to a new database in the Neo4j databases directory:
CALL gds.graph.export('my-graph', { dbName: 'mydatabase' })
The procedure yields information about the number of nodes, relationships and properties written.
Name | Type | Default | Optional | Description |
---|---|---|---|---|
dbName |
String |
|
No |
Name of the exported Neo4j database. |
writeConcurrency |
Boolean |
|
yes |
The number of concurrent threads used for writing the database. |
enableDebugLog |
Boolean |
|
yes |
Prints debug information to Neo4j log files. |
batchSize |
Integer |
|
yes |
Number of entities processed by one single thread at a time. |
defaultRelationshipType |
String |
|
yes |
Relationship type used for |
The new database can be started using databases management commands
.
The database must not exist when using the export procedure, it needs to be created manually using the following commands. |
After running the procedure, we can start a new database and query the exported graph:
:use system
CREATE DATABASE mydatabase;
:use mydatabase
MATCH (n) RETURN n;
We can export named in-memory graphs stored in the graph catalog to a set of CSV files.
All nodes, relationships and properties present in an in-memory graph are exported.
This includes data that has been projected with gds.graph.create
and data that has been added by running algorithms in mutate
mode.
The location of the exported CSV files can be configured via the configuration parameter gds.export.location
in the neo4j.conf
.
All files will be stored in a subfolder using the specified export name.
The export will fail if a folder with the given export name already exists.
The |
Export a named graph to a set of CSV files:
CALL gds.beta.graph.export.csv('my-graph', {exportName: 'myExport'})
The procedure yields information about the number of nodes, relationships and properties written.
Name | Type | Default | Optional | Description |
---|---|---|---|---|
exportName |
String |
|
No |
Name of the folder to which the CSV files are exported. |
writeConcurrency |
Boolean |
|
yes |
The number of concurrent threads used for writing the database. |
defaultRelationshipType |
String |
|
yes |
Relationship type used for |
The format of the exported CSV files is based on the format that is supported by the Neo4j Admin import
command.
Nodes are exported into files grouped by the nodes labels, i.e., for every label combination that exists in the graph a set
of export files is created.
The naming schema of the exported files is: nodes_LABELS_INDEX.csv
, where:
LABELS
is the ordered list of labels joined by _
.
INDEX
is a number between 0 and concurrency.
For each label combination one or more data files are created, as each exporter thread exports into a separate file.
Additionally, each label combination produces a single header file, which contains a single line describing the columns in the data files More information about the header files can be found here: CSV header format.
For example a Graph with the node combinations :A
, :B
and :A:B
might create the following files
nodes_A_header.csv
nodes_A_0.csv
nodes_B_header.csv
nodes_B_0.csv
nodes_B_2.csv
nodes_A_B_header.csv
nodes_A_B_0.csv
nodes_A_B_1.csv
nodes_A_B_2.csv
The format of the relationship files is similar to those of the nodes.
Relationships are exported into files grouped by the relationship type.
The naming schema of the exported files is: relationships_TYPE_INDEX.csv
, where:
TYPE
is the relationship type
INDEX
is a number between 0 and concurrency.
For each relationship type one or more data files are created, as each exporter thread exports into a separate file.
Additionally, each relationship type produces a single header file, which contains a single line describing the columns in the data files.
For example a Graph with the relationship types :KNOWS
, :LIVES_IN
might create the following files
relationships_KNOWS_header.csv
relationships_KNOWS_0.csv
relationships_LIVES_IN_header.csv
relationships_LIVES_IN_0.csv
relationships_LIVES_IN_2.csv
Using the gds.graph.export.csv.estimate
procedure it is possible to estimate the required disk space of the exported CSV files.
The estimation uses sampling to generate a more accurate estiamte.
Estimate the required disk space for exporting a named graph to CSV files.:
CALL gds.beta.graph.export.csv.export('my-graph', {exportName: 'myExport'})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory;
The procedure yields information about the required disk space.
Name | Type | Default | Optional | Description |
---|---|---|---|---|
exportName |
String |
|
No |
Name of the folder to which the CSV files are exported. |
samplingFactor |
Double |
|
yes |
The fraction of nodes and relationships to sample for the estimation. |
writeConcurrency |
Boolean |
|
yes |
The number of concurrent threads used for writing the database. |
defaultRelationshipType |
String |
|
yes |
Relationship type used for |