Running jobs
Neo4j Graph Analytics offers a catalogue of algorithms which can be executed by running jobs.
The execution of a job is most commonly done in three steps: project, compute, and write. The project-compute-write pattern is a common pattern in graph processing, where you first project a graph from your data, then run computations on the graph, and finally write the results back to your data store.
When running a job, the user submits a json-like configuration with keys corresponding to the steps of the job:
{
'project': ...,
'compute': ...,
'write': ...
}
For each of the steps we will describe its workings and the part of the configuration above.
The structure is in fact Snowflake SQL data of the type VARIANT
.
It behaves similar to json, but notably strings are represented by single quotes.
Project
Graph projection is the process of creating a graph from tables/views in a Snowflake database. A projection reads tables of two kinds, node tables and relationship tables, in order to create an in-memory graph optimized for running large-scale graph algorithms. The input tables must adhere to certain conventions, described below.
Node tables
A node table represents nodes in a graph.
Each row of the table represents one node in the graph.
The table must have a column named NODEID
, which uniquely identifies nodes within the table.
Supported data types for these node identifiers are BIGINT
and VARCHAR
.
Any other column will be treated as properties assigned to nodes, so-called node properties.
Consider the following example of a node table:
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.NODE_DATA (NODEID BIGINT, PROPERTY1 BIGINT, PROPERTY2 DOUBLE);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODE_DATA VALUES (1, 42, 13.37), (2, 43, 13.38), (3, 44, 13.39);
In the example, we have two node properties, PROPERTY1
and PROPERTY2
.
Supported data types for node properties are NUMBER
, DOUBLE
, FLOAT
, and ARRAY
.
Relationship tables
Relationship tables, must have two columns SOURCENODEID
and TARGETNODEID
storing references to node identifiers in node tables in the projection.
Each row of the table represents one relationship in the graph, that is a connection from a source node to a target node.
The SOURCENODEID
and TARGETNODEID
columns thus tell us which nodes are connected in the graph.
A relationship table can also have an additional column, which is treated as property of the relationship. This can be useful when running a weighted graph algorithm.
Consider the following example of a relationship table:
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIP_DATA (SOURCENODEID BIGINT, TARGETNODEID BIGINT, PROPERTY1 DOUBLE);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIP_DATA VALUES (1, 2, 1.0), (2, 3, 2.5);
The graph being projected consists of three nodes where the "first" is connected to the "second" with weight 1.0 and the "second" is connected to the "third" with weight 2.5.
The SOURCENODEID
and TARGETNODEID
columns refer to the NODEID
column in the node table above.
This also means that they must have the same data type as the NODEID
column.
Note, that the SOURCENODEID
and TARGETNODEID
columns can have different types if they refer to different node tables.
Any additional column is considered a relationship property.
In the example, we have one property, PROPERTY1
.
Relationship properties must be of type DOUBLE
.
Read access to tables
The projection part of any job requires the application to have read access privileges granted by the administrator.
Project configuration syntax
The syntax of the project
section of the job configuration is
'project': {
'defaultTablePrefix': OPTIONAL STRING,
'nodeTables': LIST OF STRING,
'relationshipTables': MAP FROM STRING TO MAP
}
, where each element of nodeTables
is a simple or fully qualified name of a node table and each key of relationshipTables
is a simple or fully qualified name of a relationship table.
The value for each entry in relationshipTables
is a map of the form
{
'sourceTable': STRING,
'targetTable': STRING,
'orientation': OPTIONAL STRING
}
, where sourceTable
and targetTable
are simple or fully qualified names of tables which appear in node tables.
Case-sensitivity is enforced when matching sourceTable
and targetTable
to the entries of nodeTables
.
If all or multiple tables or views in nodeTables
, relationshipTables
, sourceTable
, and targetTable
exist within the same schema, the defaultTablePrefix
parameter can help reduce the amount of code.
For orientation, see the explanation below.
For examples of project
configuration, see the example below and the example in getting-started.
Mapping from tables to heterogeneous graph
The project
configuration may contain multiple node tables and relationship tables.
In fact, the projected graph is a node-labeled and relationship-typed graph.
Such graphs are called heterogeneous graphs and certain algorithms can take advantage of the information carried by node labels and relationship types.
Node labels and relationship types can also be used to filter the graph when calling an algorithm.
That way we can run an algorithm on a subset of the graph.
The nodes arising from one of the node tables are all 'tagged' with a node label whose name is the simple name of the node table.
For example db.schema.Person
will generate the node label Person
which is case-sensitive although the table name in snowflake is not.
The relationships arising from one of the relationship tables are all 'tagged' with a relationship type whose name is the simple name of the relationship table.
Note that due to the mapping of simple table name to a node label or a relationship type, it is not possible to use multiple node tables with the same simple name, neither can two relationship tables have the same simple name.
Both node labels and relationship types are case-sensitive, so it is for example all right to project both person
and PERSON
labels.
We mentioned earlier that each node table must have unique values in its NODEID
column.
When multiple node tables are used, it is not necessary that node identifiers are unique across tables.
Orientation
In Project configuration syntax, we mentioned that each relationship table can be projected using an orientation.
Orientation can have one of the values NATURAL
, UNDIRECTED
and REVERSE
.
When no orientation is specified the default NATURAL
is used.
The NATURAL
orientation implies including relationships in the direction they are described in the table, from source node to target node.
The REVERSE
orientation instead reverses the direction, from target to source, for each relationship in the table before including it.
Finally, the UNDIRECTED
orientation is the union of the former two orientations, that is, each relationship is included in the original and also included in the reverse direction.
Compute
The compute step is the heart of the job, and consists of running one of the supported algorithms.
The corresponding compute
part of the configuration is algorithm-specific.
However, certain configuration options such as mutateProperty
are shared by multiple algorithms.
Most algorithm produce results which are stored by mutating the in-memory graph, either by producing a node property or creating relationships between nodes.
A few algorithms instead produce and store machine learning models that can be used in subsequent jobs for prediction.
Write
For the algorithms that produce graphy output data, that is, node properties or new relationships, the third and final write
step must be configured.
The page for each algorithm specifies the type of output, for example the table about write configuration contains nodeLabel
if node properties are produced and sourceLabel
and targetLabel
if relationships are produced.
We begin by describing the case of algorithms that produce node properties.
Node properties
Some algorithms output node properties as their results. To write these node properties back to Snowflake, you need to specify a list of configuration maps. Each map contains one node label and the name of one output table.
The general syntax is
'write': LIST OF MAP
, where each map is of the form
{ 'nodeLabel': STRING, 'outputTable': STRING}
, and each map mandates the writing of all nodes of one node label to an output table.
The outputTable
values must be unique across entries.
For each entry, the node label must match (case-sensitive) one of the node labels in the table to graph mapping.
The written output table contains a NODEID
column and an additional column which holds the node property to be written.
The following example illustrates how write results from e.g. WCC are written back to a new table:
'write': [
{ 'nodeLabel': 'CUSTOMERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.CUSTOMERS_COMPONENTS'}
]
Since the application will write to a table, in this case EXAMPLE_DB.DATA_SCHEMA.CUSTOMERS_COMPONENTS , the application must be granted write privileges prior to running the job.
|
If the output table already exists, the application will overwrite the table if it has privileges to do so.
Relationships properties
We will now discuss the writing of relationships that have been created by an algorithm. Algorithms that output relationship properties, create relationships in the projected in-memory graph. These relationships can be grouped according to the node labels of source and target nodes of each relationship. Since no algorithms currently produce relationships of different types, there is no relationship type exposed on the output relationships.
The write configuration abides by the following syntax
'write': LIST OF MAP
, where each map is of the form
{ 'sourceLabel': STRING, 'targetLabel': STRING, 'outputTable': STRING}
.
The sourceLabel
and targetLabel
values determine the subset of relationships to write according to the grouping by source and target node labels.
The output table contains SOURCENODEID
and TARGETNODEID
columns.
It also contains an additional column which holds the relationship property to be written, if the algorithm produces a property on the relationships.
The following example demonstrates how to write back relationships computed by e.g. the Node Similarity algorithm, including its computed property (SCORE
):
'write': [
{ 'sourceLabel': 'ORDERS', 'targetLabel': 'ORDERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.ORDER_SIMILARITIES' }
]
Since the application will write to a table, in this case EXAMPLE_DB.DATA_SCHEMA.ORDER_SIMILARITIES , the application must be granted write privileges prior to running the job.
|
If the output table already exists, the application will overwrite the table if it has privileges to do so.
Example of running a job
Let’s consider a more complex example with multiple node and relationship tables.
The following tables represent a subset of a transactional shop system.
Usually, the tables in a real-world scenario would contain more columns and rows.
Here, creating a VIEW
to transform the source tables into a format that can be used for graph projection.
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.PRODUCTS (nodeId Number, price Double);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.ORDERS (nodeId Number, total Number);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.CUSTOMERS (nodeId Number);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.LINEITEM (sourceNodeId Number, targetNodeId Number, quantity Double);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.PURCHASE (sourceNodeId Number, targetNodeId Number);
To create a graph from these tables, we need to define the node and relationship tables and their mappings.
An example is the Node Similarity algorithm, which can be used to find similar products based on the LINEITEM
and PURCHASE
relationships.
The algorithm will result in new relationships that connects similar nodes of various labels.
The relationships will contain a property similarity
with a score for how similar the connected nodes are to each other.
Lastly we write back the product-to-product and order-to-order similarity relationship to Snowflake tables for further analysis.
Please refer to the grants needed by the application and by the consumer role. |
Please also note that we use the default application name Neo4j_Graph_Analytics
in the following example.
If you chose a different app name during installation, please replace it with that.
CALL Neo4j_Graph_Analytics.graph.node_similarity('CPU_X64_XS', {
'project': {
'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
'nodeTables': [ 'PRODUCTS', 'ORDERS', 'CUSTOMERS' ],
'relationshipTables': {
'LINEITEM': {
'sourceTable': 'ORDERS',
'targetTable': 'PRODUCTS'
},
'PURCHASE': {
'sourceTable': 'CUSTOMERS',
'targetTable': 'PRODUCTS'
}
}
},
'compute': { 'topK': 2 },
'write': [
{ 'sourceLabel': 'CUSTOMERS', 'targetLabel': 'CUSTOMERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.CUSTOMER_SIMILARITIES' },
{ 'sourceLabel': 'ORDERS', 'targetLabel': 'ORDERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.ORDER_SIMILARITIES' }
]
});