Running jobs

Neo4j Graph Analytics offers a catalogue of algorithms which can be executed by running jobs.

The execution of a job is most commonly done in three steps: project, compute, and write. The project-compute-write pattern is a common pattern in graph processing, where you first project a graph from your data, then run computations on the graph, and finally write the results back to your data store.

When running a job, the user submits a json-like configuration with keys corresponding to the steps of the job:

{
  'project':  ...,
  'compute':  ...,
  'write':  ...
}

For each of the steps we will describe its workings and the part of the configuration above. The structure is in fact Snowflake SQL data of the type VARIANT. It behaves similar to json, but notably strings are represented by single quotes.

Alongside 'project', 'compute', and 'write' there is an optional defaultTablePrefix parameter that lets you simplify the input configuration you provide. More on this in the Default table prefix section below.

Project

Graph projection is the process of creating a graph from tables/views in a Snowflake database. A projection reads tables of two kinds, node tables and relationship tables, in order to create an in-memory graph optimized for running large-scale graph algorithms. The input tables must adhere to certain conventions, described below.

Node tables

A node table represents nodes in a graph. Each row of the table represents one node in the graph. The table must have a column named NODEID, which uniquely identifies nodes within the table. Supported data types for these node identifiers are BIGINT and VARCHAR. Any other column will be treated as properties assigned to nodes, so-called node properties.

Consider the following example of a node table:

CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.NODE_DATA (NODEID BIGINT, PROPERTY1 BIGINT, PROPERTY2 DOUBLE);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.NODE_DATA VALUES (1, 42, 13.37), (2, 43, 13.38), (3, 44, 13.39);

In the example, we have two node properties, PROPERTY1 and PROPERTY2. Supported data types for node properties are NUMBER, DOUBLE, FLOAT, and ARRAY.

Relationship tables

Relationship tables, must have two columns SOURCENODEID and TARGETNODEID storing references to node identifiers in node tables in the projection. Each row of the table represents one relationship in the graph, that is a connection from a source node to a target node. The SOURCENODEID and TARGETNODEID columns thus tell us which nodes are connected in the graph.

A relationship table can also have an additional column, which is treated as property of the relationship. This can be useful when running a weighted graph algorithm.

Consider the following example of a relationship table:

CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIP_DATA (SOURCENODEID BIGINT, TARGETNODEID BIGINT, PROPERTY1 DOUBLE);
INSERT INTO EXAMPLE_DB.DATA_SCHEMA.RELATIONSHIP_DATA VALUES (1, 2, 1.0), (2, 3, 2.5);

The graph being projected consists of three nodes where the "first" is connected to the "second" with weight 1.0 and the "second" is connected to the "third" with weight 2.5. The SOURCENODEID and TARGETNODEID columns refer to the NODEID column in the node table above. This also means that they must have the same data type as the NODEID column. Note, that the SOURCENODEID and TARGETNODEID columns can have different types if they refer to different node tables. Any additional column is considered a relationship property. In the example, we have one property, PROPERTY1. Relationship properties must be of type DOUBLE.

Read access to tables

The projection part of any job requires the application to have read access privileges granted by the administrator.

Project configuration syntax

The syntax of the project section of the job configuration is

'project': {
    'defaultTablePrefix': OPTIONAL STRING,
    'nodeTables': LIST OF STRING,
    'relationshipTables': MAP FROM STRING TO MAP
}

where each element of nodeTables is a simple or fully qualified name of a node table and each key of relationshipTables is a simple or fully qualified name of a relationship table. The value for each entry in relationshipTables is a map of the form

{
    'sourceTable': STRING,
    'targetTable': STRING,
    'orientation': OPTIONAL STRING
}

where sourceTable and targetTable are simple or fully qualified names of tables which appear in node tables. Case-sensitivity is enforced when matching sourceTable and targetTable to the entries of nodeTables. If all or multiple tables or views in nodeTables, relationshipTables, sourceTable, and targetTable exist within the same schema, the defaultTablePrefix parameter can help reduce the amount of code. For orientation, see the explanation below.

For examples of project configuration, see the example below and the example in getting-started.

Mapping from tables to heterogeneous graph

The project configuration may contain multiple node tables and relationship tables. In fact, the projected graph is a node-labeled and relationship-typed graph. Such graphs are called heterogeneous graphs and certain algorithms can take advantage of the information carried by node labels and relationship types. Node labels and relationship types can also be used to filter the graph when calling an algorithm. That way we can run an algorithm on a subset of the graph.

The nodes arising from one of the node tables are all 'tagged' with a node label whose name is the simple name of the node table. For example db.schema.Person will generate the node label Person which is case-sensitive although the table name in snowflake is not.

The relationships arising from one of the relationship tables are all 'tagged' with a relationship type whose name is the simple name of the relationship table.

Note that due to the mapping of simple table name to a node label or a relationship type, it is not possible to use multiple node tables with the same simple name, neither can two relationship tables have the same simple name. Both node labels and relationship types are case-sensitive, so it is for example all right to project both person and PERSON labels.

We mentioned earlier that each node table must have unique values in its NODEID column. When multiple node tables are used, it is not necessary that node identifiers are unique across tables.

Orientation

In Project configuration syntax, we mentioned that each relationship table can be projected using an orientation. Orientation can have one of the values NATURAL, UNDIRECTED and REVERSE. When no orientation is specified the default NATURAL is used. The NATURAL orientation implies including relationships in the direction they are described in the table, from source node to target node. The REVERSE orientation instead reverses the direction, from target to source, for each relationship in the table before including it. Finally, the UNDIRECTED orientation is the union of the former two orientations, that is, each relationship is included in the original and also included in the reverse direction.

Compute

The compute step is the heart of the job, and consists of running one of the supported algorithms. The corresponding compute part of the configuration is algorithm-specific. However, certain configuration options such as mutateProperty are shared by multiple algorithms. Most algorithm produce results which are stored by mutating the in-memory graph, either by producing a node property or creating relationships between nodes. A few algorithms instead produce and store machine learning models that can be used in subsequent jobs for prediction.

Write

For the algorithms that produce graphy output data, that is, node properties or new relationships, the third and final write step must be configured. The page for each algorithm specifies the type of output, for example the table about write configuration contains nodeLabel if node properties are produced and sourceLabel and targetLabel if relationships are produced. We begin by describing the case of algorithms that produce node properties.

Node properties

Some algorithms output node properties as their results. To write these node properties back to Snowflake, you need to specify a list of configuration maps. Each map contains one node label and the name of one output table.

The general syntax is

    'write': LIST OF MAP

, where each map is of the form

    {'nodeLabel': STRING, 'outputTable': STRING}

and each map mandates the writing of all nodes of one node label to an output table. The outputTable values must be unique across entries. For each entry, the node label must match (case-sensitive) one of the node labels in the table to graph mapping. The written output table contains a NODEID column and an additional column which holds the node property to be written.

The following example illustrates how write results from e.g. WCC are written back to a new table:

    'write': [
        {'nodeLabel': 'CUSTOMERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.CUSTOMERS_COMPONENTS'}
    ]

Since the application will write to a table, in this case EXAMPLE_DB.DATA_SCHEMA.CUSTOMERS_COMPONENTS, the application must be granted write privileges prior to running the job.

If the output table already exists, the application will overwrite the table if it has privileges to do so.

Relationships properties

We will now discuss the writing of relationships that have been created by an algorithm. Algorithms that output relationship properties, create relationships in the projected in-memory graph. These relationships can be grouped according to the node labels of source and target nodes of each relationship. Since no algorithms currently produce relationships of different types, there is no relationship type exposed on the output relationships.

The write configuration abides by the following syntax

    'write': LIST OF MAP

, where each map is of the form

    {'sourceLabel': STRING, 'targetLabel': STRING, 'outputTable': STRING}

The sourceLabel and targetLabel values determine the subset of relationships to write according to the grouping by source and target node labels. The output table contains SOURCENODEID and TARGETNODEID columns. It also contains an additional column which holds the relationship property to be written, if the algorithm produces a property on the relationships.

The following example demonstrates how to write back relationships computed by e.g. the Node Similarity algorithm, including its computed property (SCORE):

    'write': [
        {'sourceLabel': 'ORDERS', 'targetLabel': 'ORDERS', 'outputTable': 'EXAMPLE_DB.DATA_SCHEMA.ORDER_SIMILARITIES'}
    ]

Since the application will write to a table, in this case EXAMPLE_DB.DATA_SCHEMA.ORDER_SIMILARITIES, the application must be granted write privileges prior to running the job.

If the output table already exists, the application will overwrite the table if it has privileges to do so.

Example of running a job

Let’s consider a more complex example with multiple node and relationship tables.

The following tables represent a subset of a transactional shop system. Usually, the tables in a real-world scenario would contain more columns and rows. Here, creating a VIEW to transform the source tables into a format that can be used for graph projection.

CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.PRODUCTS  (nodeId Number, price Double);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.ORDERS    (nodeId Number, total Number);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.CUSTOMERS (nodeId Number);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.LINEITEM  (sourceNodeId Number, targetNodeId Number, quantity Double);
CREATE TABLE EXAMPLE_DB.DATA_SCHEMA.PURCHASE  (sourceNodeId Number, targetNodeId Number);

To create a graph from these tables, we need to define the node and relationship tables and their mappings. An example is the Node Similarity algorithm, which can be used to find similar products based on the LINEITEM and PURCHASE relationships. The algorithm will result in new relationships that connects similar nodes of various labels. The relationships will contain a property similarity with a score for how similar the connected nodes are to each other. Lastly we write back the product-to-product and order-to-order similarity relationship to Snowflake tables for further analysis.

Please refer to the grants needed by the application and by the consumer role.

Please also note that we use the default application name Neo4j_Graph_Analytics in the following example. If you chose a different app name during installation, please replace it with that.

CALL Neo4j_Graph_Analytics.graph.node_similarity('CPU_X64_XS', {
    'defaultTablePrefix': 'EXAMPLE_DB.DATA_SCHEMA',
    'project': {
        'nodeTables': ['PRODUCTS', 'ORDERS', 'CUSTOMERS'],
        'relationshipTables': {
            'LINEITEM': {
                'sourceTable': 'ORDERS',
                'targetTable': 'PRODUCTS'
            },
            'PURCHASE': {
                'sourceTable': 'CUSTOMERS',
                'targetTable': 'PRODUCTS'
            }
        }
    },
    'compute': {'topK': 2},
    'write': [
        {'sourceLabel': 'CUSTOMERS', 'targetLabel': 'CUSTOMERS', 'outputTable': 'CUSTOMER_SIMILARITIES'},
        {'sourceLabel': 'ORDERS', 'targetLabel': 'ORDERS', 'outputTable': 'ORDER_SIMILARITIES'}
    ]
});

Please note that the defaultTablePrefix here is used to specify a default database and schema for all tables in the configuration. See the section Default table prefix for more on this.

Default table prefix

An optional defaultTablePrefix top level configuration parameter can be used to specify a default table prefix for all tables in the configuration. It should be given as a VARCHAR in the format <database>.<schema>. This allows you to omit the database and schema names from the table names in the configuration. If used, it applies to all parts of the configuration, including project, compute, and write configs. Even though the defaultTablePrefix is used, a table from any other database and schema can be used by providing the fully qualified table name in the configuration.