Running jobs
Neo4j Graph Analytics for Snowflake is in Public Preview and is not intended for production use. |
Project
Graph projection is the process of creating a graph from tables/views in a Snowflake database. The tables represent nodes and relationships, rows represent entities, and columns represent identifiers and properties. Graph projection copies the content of the tables into a named in-memory graph. The in-memory graph is a data representation optimized for running large-scale graph algorithms.
Node tables
A node table is a table that represents nodes in a graph.
Each row in the table represents a node entity.
A node is uniquely identified within the table by a node identifier (nodeId
).
Consider the following example of a node table:
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.NODE_DATA (nodeId Number, property1 Number, property2 Double);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODE_DATA VALUES (1, 42, 13.37), (2, 43, 13.38), (3, 44, 13.39);
The nodeId
column uniquely identifies each node in the table.
Supported data types for node identifiers are Number
(BIGINT
) and Varchar
.
Any additional column is considered a node property.
In the example, we have two properties, property1
and property2
.
Supported data types for node properties are Number
, Double
, Float
, and Array
.
Relationship tables
A relationship table is a table that represents relationships between nodes in a graph.
Each row in the table represents a relationship entity.
A relationship is not required to have a unique identifier.
Instead, it must have a source node identifier (sourceNodeId
) and a target node identifier (targetNodeId
).
The values in those columns refer to node identifiers in node tables.
Consider the following example of a relationship table:
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIP_DATA (sourceNodeId Number, targetNodeId Number, property1 Double);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIP_DATA VALUES (1, 2, 42), (2, 3, 43);
The sourceNodeId
and targetNodeId
columns refer to the nodeId
column in the node table.
This also means that they must have the same data type as the nodeId
column.
Note, that the sourceNodeId
and targetNodeId
columns can have different types if they refer to different node tables.
Any additional column is considered a relationship property.
In the example, we have one property, property1
.
Relationship properties must be of type Double
.
Granting access
Before we can create a graph, we need to grant read access to the tables to the application.
This is done by granting the USAGE
privilege on the database and schema and the SELECT
privilege on the tables.
The following example grants access to the Neo4j_Graph_Analytics
application:
GRANT USAGE ON DATABASE EXAMPLE_DB TO APPLICATION Neo4j_Graph_Analytics;
GRANT USAGE ON SCHEMA EXAMPLE_DB.DATA_SCHEMA TO APPLICATION Neo4j_Graph_Analytics;
GRANT SELECT ON ALL TABLES IN SCHEMA EXAMPLE_DB.DATA_SCHEMA TO APPLICATION Neo4j_Graph_Analytics;
For more on this, please see the Getting started page.
Graph projection
When running an algorithm in Neo4j Graph Analytics for Snowflake, the first step is always loading data from your tables into an in-memory data structure.
We rely on the project
section of the call configuration to facilitate that.
It is there that you specify the node and relationship tables to load from when creating the in-memory graph.
The following example creates a graph from the NODE_DATA
and RELATIONSHIP_DATA
tables:
'project': {
'nodeTables': ['EXAMPLE_DB.DATA_SCHEMA.NODE_DATA'],
'relationshipTables': {
'EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIP_DATA': {
'sourceTable': 'EXAMPLE_DB.DATA_SCHEMA.NODE_DATA',
'targetTable': 'EXAMPLE_DB.DATA_SCHEMA.NODE_DATA',
'orientation': 'NATURAL'
}
}
}
In the project configuration section, the nodeTables
parameter specifies the node table mappings.
A node table mapping consists of a table name and a node label.
There is a one-to-one mapping between node tables and node labels.
The relationshipTables
parameter specifies the relationship table mappings.
Each relationship table mapping consists of a table name and its own configuration object.
The sourceTable
and targetTable
parameters specify the node tables that the relationship table refers to.
The orientation
parameter is optional and specifies the relationship orientation, here NATURAL
.
Node labels and relationship types are used to group nodes and relationships in the in-memory graph. They can be used to filter the graph when calling an algorithm. That way we can run an algorithm on a subset of the graph.
If we want to project relationships using a different orientation, we can specify that in the configuration.
Possible values are NATURAL
(default), UNDIRECTED
and REVERSE
.
If all necessary tables or views nodeTables
, relationshipTables
, sourceTable
, and targetTable
exist within the same schema, the defaultTablePrefix
parameter can help reduce the amount of code required.
'project': {
'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
'nodeTables': ['NODE_DATA'],
'relationshipTables': {
'RELATIONSHIP_DATA': {
'sourceTable': 'NODE_DATA',
'targetTable': 'NODE_DATA',
'orientation': 'NATURAL'
}
}
}
This simplified configuration object can be used alongside defining all required projection tables as views within the same schema, allowing them to be referenced in the configuration without specifying the schema name.
It is highly recommended that the values in the nodeId
column are unique (e.g. by using DISTINCT
in a SELECT
statement).
Neo4j Graph Analytics for Snowflake only works with unique identifiers and will deduplicate them internally only projecting the first occurrence of a node, ignoring all others.
By deduplicating before projecting, you can control how the properties are deduplicated as well as expect a faster projection, as less data needs to be read from the node query.
Multi-table example
Let’s consider a more complex example with multiple node and relationship tables.
The following tables represent a subset of a transactional shop system.
Usually, the tables in a real-world scenario would contain more columns and rows.
Here, creating a VIEW
to transform the source tables into a format that can be used for graph projection.
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.PRODUCTS (nodeId Number, price Double);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.ORDERS (nodeId Number, total Number);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.CUSTOMERS (nodeId Number);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.LINEITEM (sourceNodeId Number, targetNodeId Number, quantity Double);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.PURCHASE (sourceNodeId Number, targetNodeId Number);
To create a graph from these tables, we need to define the node and relationship tables and their mappings.
An example is the Node Similarity algorithm, which can be used to find similar products based on the CONTAINS
relationship.
We can use the labels to just run the algorithm on the Product
and Order
nodes.
The algorithm will result in new relationships called SIMILAR_TO
, that connects similar products.
The relationships will contain a property similarity
with a score for how string the connection is.
Lastly we write the relationships back to Snowflake tables for further analysis.
For the required setup of grants your consumer role and environment, please see the Getting started page.
Please also note that we use the default application name Neo4j_Graph_Analytics
in the following example.
If you chose a different app name during installation, please replace it with that.
CALL Neo4j_Graph_Analytics.graph.node_similarity('CPU_X64_XS', {
'project': {
'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
'nodeTables': [ 'PRODUCTS', 'ORDERS', 'CUSTOMERS' ],
'relationshipTables': {
'LINEITEM': {
'sourceTable': 'ORDERS',
'targetTable': 'PRODUCTS'
},
'PURCHASE': {
'sourceTable': 'CUSTOMERS',
'targetTable': 'PRODUCTS'
}
}
},
'compute': { 'topK': 2 },
'write': [
{ 'sourceLabel': 'CUSTOMERS', 'targetLabel': 'CUSTOMERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.CUSTOMER_SIMILARITIES' },
{ 'sourceLabel': 'ORDERS', 'targetLabel': 'ORDERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.ORDER_SIMILARITIES' }
]
});
Write
The different algorithms have different results types: sometimes node properties, sometimes relationships, etc. Below we describe how to deal with each one in turn.
Granting permissions
In order to write data into new tables, we need to grant the CREATE
privilege on the schema to the application.
The following example grants the CREATE
privilege on the DATA_SCHEMA
schema to the Neo4j_Graph_Analytics
application:
GRANT CREATE ON SCHEMA EXAMPLE_DB.DATA_SCHEMA TO APPLICATION Neo4j_Graph_Analytics;
For more on this, please see the Getting started page.
Node properties
Some algorithms output node properties as their results. To write these node properties back to Snowflake, you need to specify both the label and the target table. The following example illustrates how results from e.g. WCC are written back to a new table:
'write': [
{ 'nodeLabel': 'CUSTOMERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.CUSTOMERS_COMPONENTS'}
]
The nodeLabel
defines for which node table mapping we want to write back the properties.
The corresponding properties will be stored in the outputTable
.
Relationships properties
For algorithms that output relationship properties, specify the desired node label and output table. The following example demonstrates how to write back the relationships computed by the Node Similarity algorithm, including the computed similarity property (score):
'write': [
{ 'sourceLabel': 'ORDERS', 'targetLabel': 'ORDERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.ORDER_SIMILARITIES' }
]
The table will be newly created and a potential existing table with the same name will be overridden.
Note that the write section of an algorithm call can contain multiple entries. It is a list, and all the write configurations are separated by commas, while the list itself is enclosed in square brackets.