FastPath — A path embedding algorithm

Neo4j Graph Analytics for Snowflake is in Public Preview and is not intended for production use.

Elevator pitch

FastPath is a light-weight path embedding algorithm operating on temporal graphs. It computes embeddings for base nodes that are tied to event sequences. Events are associated to vectors depending on the event type and (continuously) on the elapsed time since the event occurred. These vectors are aggregated into embeddings that can be used in machine learning tasks, like customer journey analysis, and generalize across different time periods.

Short note on graph terminology

The FastPath algorithm conceptually works on graphs, so this documentation speaks of graphs, nodes, relationships and properties. Moreover, the graphs are heterogeneous, so nodes are tagged with node labels and relationships have types. The correspondence between graphs and actual Snowflake tables is described in the sections Node tables and Relationship tables.

Introduction

FastPath is a path embedding algorithm. This means that the input graph is a temporal graph consisting of

  • base nodes — Nodes of interest that the algorithm aims to compute embeddings for

  • event nodes — Nodes marked with time stamp and associated with base nodes

There are two supported graph schemas that encode the event chains.

The event nodes can hold information regarding what the event entails. This information can be in the form of feature vectors, categorical properties or relationships to context nodes.

In summary, the input consists of a collection of base nodes, each of which have a time series of information carrying events.

For each base node, an output time is specified. The output time governs two concerns of the embedding. First, only events up to the output time are used to arrive at the embedding. Second, the algorithm uses elapsed time rather than absolute time to compute the embeddings. When determining the contribution of an event on the embedding of a base node, the elapsed time is the amount of time between the event and the output time of the base node.

Let’s consider an example to make this more clear: One of the base nodes is a person called "Joe" and one of the events represents the fact that "Joe" consumed a dose of aspirin. The algorithm will store

  • the type of event — here "taking of aspirin"

  • the elapsed time — here 7 days

Further, each pair of event type and elapsed time, is associated with a "semi-random" vector. The vectors are constructed using both randomness, similar to the so called FastRP algorithm, and a time smoothing mechanism. This results in vectors for "taking of aspirin 7 days ago" being (relatively) similar to "taking of aspirin 8 days ago".

The embedding for Joe will be a weighted sum of the vector of "taking of aspirin 7 days ago" and vectors of all other events that occurred for Joe. How the weighting works is described in more detail below.

It may seem that the above only gives an intuition for discreet event information like categorical properties and relationships to context nodes. However, feature vectors on events can be thought of as a weighted sum of discreet events corresponding to the components of the vector.

The output node embeddings can for example be used in downstream machine learning tasks for solving problems related to e.g., customer journey. Because of the reliance on relative time rather than absolute time, such models can generalize from embeddings of past graphs to current graphs. In addition, it holds that FastPATH is equivariant over time; If all events for a base node are shifted by T time steps and the output time is also shifted by T time steps (in the same direction), the embedding of the base node stays the same.

Endpoint

The endpoint name is graph.fastpath, and it takes two positional arguments as input.

The first argument is a VARCHAR that specifies which compute pool to use. For this algorithm a CPU compute pool is recommended, since the algorithm does not utilize GPUs.

The second argument is a JSON configuration map. This JSON must contain the three following keys:

Name

Type

Default

Optional

Description

project

Map

n/a

no

Configuration for the input graph

compute

Map

n/a

no

Configuration for algorithm-specific parameters

write

List

n/a

no

Configuration for writing back algorithm results

Configuration

In this section, we describe the configuration parameters that must be provided to the endpoint.

Input graph configuration

The project map contains the following key pairs:

Name

Type

Default

Optional

Description

nodeTables

List

n/a

no

A list of table names, which each represent a node label in the input graph

relationshipTables

Map

n/a

no

A map from table names representing relationship types, to maps of configuration for that relationship type in the input graph (details below)

defaultTablePrefix

String

n/a

yes

A default database and schema prefix to use for table names in the input graph. Should be of the format "<database>.<schema>"

If a defaultTablePrefix is not provided, all table names must be qualified with a database and schema name. That is, they should be given a strings of the format "<database>.<schema>.<table>". If a defaultTablePrefix is provided, table names may also be given as "<table>", in which case the prefix will be prepended to them.

All provided node tables and relationship tables must have unique names. Not only must fully qualified table names be unique, but the table names themselves must also be unique.

Node tables

The nodeTables list of the project map must contain an entry for each type of node in the input graph. In each such table, nodes are represented by rows. There must be at least one column in each table; one that represents the node ID, and this column must be named nodeid (case insensitive). Each node ID must be unique within its table. The type of the nodeid column must be either BIGINT or VARCHAR. In addition to the nodeid column, the table may contain additional columns that represent node properties, for example features of the nodes.

Relationship tables

The relationshipTables map of the project must contain an entry for each type of relationship in the input graph. Each key in the relationshipTables map is the name of the table containing the relationships for one type of relationships, and each value is a map of configuration for that type of relationship. The configuration map for each relationship type looks like the following:

Name

Type

Default

Optional

Description

sourceTable

String

n/a

no

The name of the table that contains the source nodes of the relationships

targetTable

String

n/a

no

The name of the table that contains the target nodes of the relationships

orientation

String

"NATURAL"

yes

How to interpret the orientation (direction) of the provided relationships. Possible values are "NATURAL", "REVERSE" and "UNDIRECTED"

In each provided relationship table, relationships are represented by rows. There are exactly two columns that must be present in each relationship table: sourcenodeid and targetnodeid (both case-insensitive). These specify the source and target nodes of the relationship, respectively, and should correspond to the node IDs in the provided source and target tables. The type of the sourcenodeid and targetnodeid columns must be either BIGINT or VARCHAR.

The orientation parameter specifies how to interpret the direction of the relationships. By default, relationships are interpreted as having the "NATURAL" orientation, meaning that they are assumed to be directed from the source node to the target node. If the orientation is set to "REVERSE", the relationships are interpreted as being directed from the target node to the source node. And if the orientation is set to "UNDIRECTED", the relationships are interpreted as being undirected, meaning that they are symmetric and can be traversed in either direction (independently of which node is the source and which is the target).

Algorithm configuration

The following parameters can be configured for the FastPath endpoint:

Configuration parameter

Type

Default

Optional

Description

baseNodeLabel

String

n/a

No

Node label for which output embeddings are desired

eventNodeLabel

String

n/a

No

Node label of events related to base nodes

contextNodeLabel

String

None

Yes

Node label of context nodes that describe events

timeNodeProperty

String

None

Yes

Name of a node property representing time on the event nodes. May be int or float.

nextRelationshipType

String

None

Yes

Name of a relationship type between event nodes, indicating order of events

firstRelationshipType

String

None

Yes

Name of a relationship type from base nodes to event nodes, indicating the first event for each base node

eventFeatures

String

None

Yes

Name of a node property on event nodes which holds numerical features in vector form

categoricalEventProperties

List of String

None

Yes

Names of node properties on event nodes which hold categorical values or lists of categorical values

ignoredEventCategory

Integer

-1

Yes

A category value representing that a category is missing, and is ignored when constructing the event embedding

outputTime

Float

None

Yes

The timestamp at which embeddings should be produced. Events at this timestamp or later are not processed.

outputTimeProperty

String

None

Yes

Name of a node property on base nodes which holds a number indicating for what timestamp an output embedding should be produced

numElapsedTimes

Integer

n/a

No

The number of times in the "grid" . Elapsed time from event to output time is rounded to this grid. If set to 1, timestamps of events have no effect.

decayFactor

Float

1.0

Yes

Coefficient controlling the speed of decay of influence of older events

maxElapsedTime

Integer

n/a

No

The maximum age of events, relative to the output time, that are considered for the embedding.

smoothingWindow

Integer

0

Yes

Event embeddings are aggregated over a window containing up to 2*smoothingWindow + 1 grid times.

smoothingRate

Float

0.0

Yes

Controls how fast expected similarity of events decays as their time-distance increases.

dimension

Integer

n/a

No

The output dimension of the embeddings

randomSeed

Integer

Random number

Yes

A random seed which is used for all randomness in computing the embeddings.

Configuration concepts explained

This section goes into more detail about how to configure certain aspects of the algorithm.

Graph schema and relationship configuration

The algorithm supports two different schemas for connecting base nodes to event nodes.

  1. To use the first schema configuration you provide a firstRelationshipType and nextRelationshipType, and timeNodeProperty is optional. With this configuration each base and event node is assumed to be part of a path of the pattern: (:baseNodeLabel)-[:firstRelationshipType]-(:eventNodeLabel)-[:nextRelationshipType]-…​-[:nextRelationshipType]→(:eventNodeLabel) image4

  2. To use the second schema configuration, you provide only a timeNodeProperty. It’s then assumed that each base node has outgoing relationships to each of its event nodes: (:baseNodeLabel)-[:]→(:eventNodeLabel)

image5

Optionally, the schema can also contain a node label called contextNodeLabel, see Event input features below.

Defining event timestamps

There are two ways of specifying timestamps for event nodes. The first way is to provide a timeNodeProperty, in which case each event node must hold a property of that name representing a timestamp.

If however, a nextRelationshipType is provided, timeNodeProperty does not have to be provided. In that case, time stamps are generated in an intuitive way depicted in the picture below.

Timestamp generation

Figure "Timestamp generation": The length of the nextRelationshipType path to an event will define its timestamp. So then the first event node in a base node’s event chain, which is connected via a firstRelationshipType relationship from the base node, will have timestamp 0. The next event node, accessed via a nextRelationshipType relationship from the first event node, will have timestamp 1, and so on.

Please note that you can still provide a timeNodeProperty even if nextRelationshipType is used. In that case, the nextRelationshipType will not be used to infer timestamps, but rather which base node an event node belongs to.

Event input features

An event’s feature representation can be impacted by three different input sources. They are independent of each other and can be used both together and separately. The event feature vector will be a sum over all input sources that are enabled.

  1. To enable the Context nodes input source you provide a contextNodeLabel. In this case it’s assumed that there may be outgoing relationships from event nodes to context nodes, i.e., the pattern:
    (:eventNodeLabel)-[:]→(:contextNodeLabel)

  2. To enable the Event feature vector properties input source you provide eventFeatures. This should be the name of a vector property present on all event nodes.

  3. To enable the Categorical event properties input source you provide categoricalEventProperties. This should be the name of a property that holds integers representing categories or lists of such integers. The property must be present on all event nodes.

Time-smoothing

If smoothingRate is large, actually a small amount of smoothing is applied and events that are even a small time apart (enough to be closest to different grid points) will generally have very different embeddings.

If smoothingWindow is 0, the effect is similar to having a large smoothingRate. On the other hand, if smoothingWindow is large in relation to numElapsedTimes, and simultaneously smoothingRate is small, identical events will have similar embeddings even when their timestamps differ a lot.

Output

The output is Snowflake table containing two columns;

  • nodeid — the node’s id according to an input node table

  • embedding — the FastPath embedding of the node

The name and location of the table is given in Input graph configuration.

Example

We assume that we have a Snowflake schema patient_db.general containing the following tables:

  • patient with columns nodeid and output_time

  • encounter with columns nodeid, embedding and time

  • has_encounter with columns sourcenodeid and targetnodeid

These tables represent a graph of medical patient records.

To run the query, there is a required setup of grants for the application, your consumer role and your environment. Please see the Getting started page for more on this.

We also assume that the application name is the default Neo4j_Graph_Analytics. If you chose a different app name during installation, please replace it with that.

The FastPATH algorithm can be run on this graph by executing a statement such as:

CALL Neo4j_Graph_Analytics.graph.fastpath('CPU_X64_M', {
    'project': {
        'defaultTablePrefix': 'patient_db.general',
        'nodeTables': ['patient', 'encounter'],
        'relationshipTables': {
            'has_encounter': {
                'sourceTable': 'patient',
                'targetTable': 'encounter'
            }
        }
    },
    'compute': {
        'baseNodeLabel': 'patient',
        'eventNodeLabel': 'encounter',
        'eventFeatures': 'embedding',
        'outputTimeProperty': 'output_time',
        'timeNodeProperty': 'time',
        'dimension': 32, -- in reality, a higher value is recommended
        'numElapsedTimes': 30,
        'maxElapsedTime': 3650,
        'smoothingRate': 0.9,
        'smoothingWindow': 2,
        'decayFactor': 0.0
    },
    'write': [{
        'nodeLabel': 'patient',
        'outputTable': 'patient_db.general.embeddings'
    }]
});

An example of the output is:

JOB_ID JOB_START JOB_END JOB_RESULT

job_7bed5ef80dc24eec9b0d86d0cbe7baa4

2025-04-28 14:12:46.439

2025-04-28 14:12:57.447

{"node_output_stats": {"patient1": {"row_count": 1000, "table_name": "patient_db.general.embeddings"}}}

The produced node embeddings can be simply explored like so:

SELECT * FROM patient_db.general.embeddings LIMIT 10;

which returns data of the form

NODEID	EMBEDDING
0	    [5.341047,3.848688,-5.973229,-6.814521,4.745209,9.643475,-0.045684,4.210951,-2.004807,4.428759,13.497086,-6.179376,-0.986118,-7.707472,-0.555529,-6.073474,4.083053,1.602635,1.958997,12.649292,-0.603967,10.158164,-0.338394,-0.608493,-4.837356,15.097547,-6.569559,-2.797304,0.338724,6.874722,9.505033,-8.754766]
1	    [-6.724641,-11.390027,-4.453455,4.120544,-2.017518,-2.675358,2.027915,-7.672874,-2.879899,17.374033,11.240053,7.185143,10.593451,-0.032012,6.947659,-29.623604,0.166914,11.507900,-12.466205,-1.891875,-8.164936,7.390674,-10.806652,2.019349,-1.903723,29.574389,1.859032,3.853385,0.566263,-5.038345,4.792415,3.326340]
2	    [-2.920011,-2.048257,-5.708614,-2.373963,-0.244315,2.887959,-23.091557,-0.587305,4.643905,-0.775859,7.227013,-1.861291,-4.952862,-7.039846,-11.932112,-8.986797,-1.304907,2.019032,3.064439,-4.360886,-2.295964,6.218091,0.975781,-3.919339,8.686658,17.961964,-7.780995,-1.146844,-6.269790,20.006420,-1.216949,-3.596368]
3	    [-8.567083,-9.307947,-11.381653,-0.802077,-1.286643,-3.037168,5.494521,-1.814128,-1.623972,-0.085408,-8.678900,6.716533,-3.552778,10.777124,8.609731,2.346340,-6.656413,4.587024,-6.891663,0.081429,-3.976248,-3.998426,-0.445967,4.443917,-7.440049,15.908058,-6.126621,-9.205961,-3.693087,5.798036,-10.521414,25.272419]
4	    [-3.993221,-15.716671,-7.102195,-22.005224,-0.855417,9.282584,-2.253866,-16.234423,24.112732,2.810601,-13.975744,-12.062764,-2.387400,20.061924,-1.067919,12.701017,-4.327281,-17.721920,23.729225,-9.824536,-8.265147,-2.768445,-25.310749,8.047261,13.448681,51.172272,-15.435410,14.756327,-4.342721,11.190716,28.090958,9.425692]
5	    [-13.833055,-3.474358,-29.580704,15.354573,46.242706,13.694808,32.601025,26.757681,-7.090479,57.280357,-0.735581,38.106731,26.968494,3.217232,-23.623219,2.485056,-6.973878,0.877117,-15.995618,55.822735,18.801920,52.809269,-47.148773,39.193768,-4.173650,-12.601570,-14.213085,-21.923597,12.738949,-27.862555,-9.950893,-20.819094]
6	    [13.704840,13.533702,-5.135115,-6.226133,-5.701371,-2.703825,5.956636,3.647114,2.735878,2.559484,-15.129639,9.952298,2.568221,1.664974,-0.048340,-10.921870,-17.725523,1.495797,-9.043520,25.950590,-0.110576,9.694299,-18.446062,-6.852248,6.842557,35.075626,8.634172,7.099162,-5.890983,9.706918,-8.494190,6.583567]
7	    [-2.812654,2.941338,1.067755,-4.293394,-3.086962,-0.469450,3.613479,-2.819084,5.120227,7.050265,-2.466012,-2.169495,0.909174,2.145000,3.638202,-2.498263,-5.502569,-3.852499,5.146134,1.659534,-0.714558,-0.469029,2.176041,-1.869031,-4.940216,3.088835,3.752839,3.714232,-2.807177,-0.698254,-5.012471,2.528090]
8	    [-7.303960,-7.729638,-4.773112,-9.641366,2.609788,-20.279972,6.826665,8.960463,19.718428,3.388449,7.409564,-17.707247,-0.936261,28.070925,-3.749211,-9.148918,16.058552,4.065396,10.297297,9.403608,-10.898782,-10.801480,-25.112486,3.316337,17.157797,12.428331,19.813198,10.889541,-1.723780,-1.651543,-0.261916,9.012156]
9	    [-6.429783,-20.400585,16.191334,-16.042753,9.880158,-5.386495,-10.220762,15.675117,9.438731,-3.742106,10.518373,3.956760,6.789544,-0.625264,-8.893956,1.050145,-15.966327,-6.890019,-2.499497,8.597721,3.010286,3.782031,-10.274326,20.255251,1.046289,17.648533,-6.685064,13.613426,7.041289,13.350776,-3.143279,-5.788550]

Appendix

The following sections provide more detailed information about FastPath for those interested in the inner workings of the algorithm.

How the algorithm works

The below breakdown of the algorithm into steps is conceptual. It does not reflect the actual implementation details such as for example as order of loops, datastructures and more. We will describe how to compute the FastPATH embedding for a base node b which has output time to. The output time is computed according to one of the chosen parameters outputTime or outputTimeProperty. An arbitrary event which belongs to b will be denoted by e, and its time stamp is assumed to be te. At the time to, at which we wish to compute the embedding of b, the elapsed time of e is
tel = to - te, that is, the amount of time between the timestamp of the event and the output time of the corresponding base node.

Note, that events may belong to multiple base nodes which have different output times. This implies that some computations involving an event have to be recomputed for each of its associated base nodes, but luckily these computations are independent of each other.

  1. Preprocess
    Pre-process the input into a graph with base nodes, event nodes and optionally context nodes. Base and event nodes are connected and so are events and context nodes. If time stamps are not given for events, time stamps are generated by traversing event chains per base node.

  2. Filter relevant events
    Out of the events that belong to the base node b, any event which has elapsed time exceeding maxElapsedTime, or any event which occurred at time to or later is removed. The remaining relevant events for b will be called E(b).

  3. Create grid
    Create a grid over time subsequently referred to as the grid

  4. Generate random vectors
    Sparse random vectors are generated independently per input sources below, if they are selected.
    For each time in the grid and input source, generate vectors as follows:

    1. Event feature vector properties: a vector ri,t is generated for each dimension i of the input event feature vectors Image1

    2. Context nodes: for each context node c, generate rc,t

    3. Categorical event properties: for each categorical node property prop and for each value v of that property, initialize a random vector rprop,v,t.
      If the property value for any event is a list of categories [v1, v2, …​], and v is any of these values, we initialize a random vector r~prop,v.

  5. Compute event node pre-embeddings
    The pre-embedding for an event e at time t is denoted by xe,t and initialized to all zeros. For all input sources below that apply, weighted sums of random vectors generated in step 3 are added to xe,t:

    1. Event feature vector properties:
      xe,t += ∑i ri,t * inpi
      where inp is the input vector of e and i ranges over the dimensions of inp.

    2. Context nodes:
      xe,t += ∑c rc,t
      where c ranges over all neighbors of the event e.

    3. Categorical event properties:
      xe,t += ∑prop rprop,e.prop,t
      where prop ranges over all selected categorical event properties.
      Or when e.prop is a list: xe,t = ∑propv in e.prop rprop,v,t

  6. Compute event embeddings
    Image2
    We compute an embedding for e using its earlier derived pre-embeddings at different times. We find the time tcl in the grid which is the closest to the elapsed time tel = to - te. Then adding up to smoothingWindow points in the grid on both sides of tcl, we to obtain a window We of relevant grid times for the event. In the picture, smoothingWindow = 1 and tcl=t4. Therefore We = {t3,t4,t5}. For each t in We we set the weight of the grid point t to be,
    w(t) = exp(-|t - tel| * smoothingRate ) / Ze,
    where Ze is the constant that ensures t in W_e w(t) = 1.
    We can now finally set the embedding of the event at output time to to be
    emb(e|to) = ∑t in W_e w(t) * xe,t. We have hence constructed time-smoothed embeddings by taking an average of a grid of pre-embeddings, or basis vectors if you will. The purpose of the time-smoothing mechanism is to ensure that embeddings change continuously with respect to changes in elapsed time, all other factors being equal.

  7. Compute base node embeddings
    The embedding for b is finally given by
    emb(b) = ∑e in E(b) wb,e * emb(e)
    where
    wb,e = exp(-decayFactor * (to - te)) and E(b) was defined in step 2.

The figure below gives an overview of what influences event embeddings and how they flow to the embedding of a base node, based on the output time and elapsed times.

image3