Identify Similar Patient Journeys in Healthcare Data

Organizations across healthcare generate vast amounts of patient data, yet traditional analytics often miss the deeper behavioral patterns hidden within it. By combining Snowflake’s scalable data platform with Neo4j Graph Analytics, you can uncover clusters of patients who share similar care pathways, admission patterns, diagnoses, or treatment progressions. This guide walks you through how to model patient journey data as a graph, apply similarity and community-detection algorithms, and write the results back into Snowflake to support population health insights, care management improvements, and proactive intervention strategies.

What You Will Need

The Native App Neo4j Graph Analytics for Snowflake
A Snowflake account with appropriate access to databases and schemas.
Neo4j Graph Analytics application installed from the Snowflake marketplace. Access the marketplace via the menu bar on the left hand side of your screen, as seen below:

marketplace

What You Will Build

A method to identify similar patient journeys

What You Will Learn

How to prepare and project your data for graph analytics
How to use node similarity to identify similar nodes
How to read and write directly from and to your snowflake tables

Load The Data

Dataset overview : This dataset is modelled to design and analyze patients and different procedures that they undergo using graph analytics.

Let’s name our database NEO4J_PATIENT_DB. Using the CSVs found here, We are going to add two new tables:

One called PROCEDURES based on the Procedures.csv
One called PATIENTS based on Patients.csv

Follow the steps found here to load in your data.

Set Up

Import The Notebook

We’ve provided a notebook to walk you through each SQL and Python step—no local setup required!
Download the .ipynb found here, and import the notebook into snowflake.

Permissions

Next let’s set up the necessary roles, permissions, and resource access to enable Graph Analytics to operate on data within the NEO4J_PATIENT_DB.PUBLIC.SCHEMA. It creates a consumer role (gds_user_role) for users and administrators, grants the Neo4j Graph Analytics application access to read from and write to tables and views, and ensures that future tables are accessible.

It also provides the application with access to the required compute pool and warehouse resources needed to run graph algorithms at scale.

-- Use a role with the required privileges
USE ROLE ACCOUNTADMIN;

-- Create a consumer role for users of the Graph Analytics application
CREATE ROLE IF NOT EXISTS MY_CONSUMER_ROLE;
GRANT APPLICATION ROLE Neo4j_Graph_Analytics.app_user TO ROLE MY_CONSUMER_ROLE;
SET MY_USER = (SELECT CURRENT_USER());
GRANT ROLE MY_CONSUMER_ROLE TO USER IDENTIFIER($MY_USER);

USE SCHEMA NEO4J_PATIENT_DB.PUBLIC;
CREATE TABLE NODES (nodeId Number);
INSERT INTO NODES VALUES (1), (2), (3), (4), (5), (6);
CREATE TABLE RELATIONSHIPS (sourceNodeId Number, targetNodeId Number);
INSERT INTO RELATIONSHIPS VALUES (1, 2), (2, 3), (4, 5), (5, 6);

-- Grants needed for the app to read consumer data stored in tables and views, using a database role
USE DATABASE NEO4J_PATIENT_DB;
CREATE DATABASE ROLE IF NOT EXISTS MY_DB_ROLE;
GRANT USAGE ON DATABASE NEO4J_PATIENT_DB TO DATABASE ROLE MY_DB_ROLE;
GRANT USAGE ON SCHEMA NEO4J_PATIENT_DB.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT SELECT ON ALL TABLES IN SCHEMA NEO4J_PATIENT_DB.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT SELECT ON ALL VIEWS IN SCHEMA NEO4J_PATIENT_DB.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
-- Future tables also include tables that are created by the application itself.
-- This is useful as many use-cases require running algorithms in a sequence and using the output of a prior algorithm as input.
GRANT SELECT ON FUTURE TABLES IN SCHEMA NEO4J_PATIENT_DB.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA NEO4J_PATIENT_DB.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT CREATE TABLE ON SCHEMA NEO4J_PATIENT_DB.PUBLIC TO DATABASE ROLE MY_DB_ROLE;
GRANT DATABASE ROLE MY_DB_ROLE TO APPLICATION Neo4j_Graph_Analytics;

-- Ensure the consumer role has access to tables created by the application
GRANT USAGE ON DATABASE NEO4J_PATIENT_DB TO ROLE MY_CONSUMER_ROLE;
GRANT USAGE ON SCHEMA NEO4J_PATIENT_DB.PUBLIC TO ROLE MY_CONSUMER_ROLE;
GRANT SELECT ON FUTURE TABLES IN SCHEMA NEO4J_PATIENT_DB.PUBLIC TO ROLE MY_CONSUMER_ROLE;

-- Use the consumer role to run the algorithm and inspect the output
USE ROLE MY_CONSUMER_ROLE;

Clean Our Data

We need our data to be in a particular format in order to work with Graph Analytics. In general it should be like so:

For The Table Representing Nodes:

The first column should be called nodeId, which represents the ids for the each node in our graph

For The table Representing Relationships:

We need to have columns called sourceNodeId and targetNodeId. These will tell Graph Analytics the direction of the transaction, which in this case means:

Who is the patient (sourceNodeId) and
What is the procedure (targetNodeId)

Let’s take a look at the starting point for our data. We have a table for patients and a table for procedures:

SELECT * FROM NEO4J_PATIENT_DB.PUBLIC.PROCEDURES LIMIT 10;

SELECT * FROM NEO4J_PATIENT_DB.PUBLIC.PATIENTS LIMIT 10;

We are then going to clean this up into two tables that just have the nodeids for both patient and procedure:

CREATE OR REPLACE TABLE NEO4J_PATIENT_DB.PUBLIC.PATIENT_NODE_MAPPING (nodeId) AS
SELECT DISTINCT p.ID from NEO4J_PATIENT_DB.PUBLIC.PATIENTS p;

CREATE OR REPLACE TABLE NEO4J_PATIENT_DB.PUBLIC.PROCEDURE_NODE_MAPPING (nodeId) AS
SELECT DISTINCT p.code from NEO4J_PATIENT_DB.PUBLIC.PROCEDURES p;

In order to keep this managable, we are just going to look at patients (and what procedures they underwent) in the context of kidney disease. So first we will filter down patients to only include those with kidney disease:

// create a subset of patients that have had any of the 4 kidney disease codes
CREATE OR REPLACE VIEW KidneyPatients_vw (nodeId) AS
    SELECT DISTINCT PATIENT_NODE_MAPPING.NODEID as nodeId
    FROM PROCEDURES
            JOIN PATIENT_NODE_MAPPING ON PATIENT_NODE_MAPPING.NODEID = PROCEDURES.PATIENT
    WHERE PROCEDURES.REASONCODE IN (431857002,46177005,161665007,698306007)
;

Then we will only look at the procedures that those kidney patients have undergone:

// There are ~400K procedures - it is doubtful that the kidney patients even have used a small
// fraction of those.  To reduce GDS memory and speed algorithm execution, we want to load
// only those procedures that kidney patients have had.
CREATE OR REPLACE VIEW KidneyPatientProcedures_vw (nodeId) AS
    SELECT DISTINCT PROCEDURE_NODE_MAPPING.NODEID as nodeId
    FROM PROCEDURES
        JOIN PROCEDURE_NODE_MAPPING ON PROCEDURE_NODE_MAPPING.nodeId = PROCEDURES.CODE
        JOIN KIDNEYPATIENTS_VW ON PATIENT = PROCEDURES.PATIENT;

// create the relationship view of kidney patients to the procedures they have had
CREATE OR REPLACE VIEW KidneyPatientProcedure_relationship_vw (sourceNodeId, targetNodeId) AS
    SELECT DISTINCT PATIENT_NODE_MAPPING.NODEID as sourceNodeId, PROCEDURE_NODE_MAPPING.NODEID as targetNodeId
    FROM PATIENT_NODE_MAPPING
         JOIN PROCEDURES ON PROCEDURES.PATIENT = PATIENT_NODE_MAPPING.NODEID
         JOIN PROCEDURE_NODE_MAPPING ON PROCEDURE_NODE_MAPPING.NODEID = PROCEDURES.CODE;

Visualize Your Graph

At this point, you may want to visualize your graph to get a better understanding of how everything fits together. Specifically, you may be interested in better understanding your results. Why exactly are any given two patients considered similar?

Let’s start by filtering down our data to two patients who are considered very similar. They have the following patient ids:

a5814a17-303e-1d23-64d3-dd4b54170d16
87069a85-19bf-7d81-0cfe-609b3427d096

CREATE OR REPLACE VIEW KidneyPatient_viz_vw (nodeId, title, label) AS
SELECT
    k.nodeId,
    p.first AS title,
    'blue' as label
FROM KidneyPatients_vw k
LEFT JOIN patients p
  ON k.nodeId = p.id
WHERE k.nodeId IN (
  'a5814a17-303e-1d23-64d3-dd4b54170d16',
  '87069a85-19bf-7d81-0cfe-609b3427d096'
);

-- this represents the procedures those patients underwent (and will be our relationship table
-- for the below visualization)
CREATE OR REPLACE VIEW NEO4J_PATIENT_DB.PUBLIC.procedures_patients_vw AS
SELECT DISTINCT
    p.patient as sourcenodeid,
    p.reasoncode as targetnodeid
FROM NEO4J_PATIENT_DB.PUBLIC.PROCEDURES p
JOIN KidneyPatient_viz_vw k
  ON CAST(p.PATIENT AS STRING) = CAST(k.nodeId AS STRING);

-- now we look at the procedures in our example
CREATE OR REPLACE VIEW procedures_viz_vw (nodeId, title, label) AS
SELECT DISTINCT
    pp.targetnodeid AS nodeId,
    p.reasondescription AS caption,
    'red' as label
FROM procedures_patients_vw pp
LEFT JOIN procedures p
    ON pp.targetnodeid = p.reasoncode
WHERE pp.targetnodeid IS NOT NULL
  AND TRIM(pp.targetnodeid) <> '';

Now we can use the neo4j_viz python package to create the actual visualization. You can learn more about how it works here, but for this example, we are just going to customize two things. First, we will use the “label” property of our nodes (which we defined in the SQL query above) to set the colors. Like so:

viz_graph.color_nodes(property='LABEL', override=True)

Next, we will use the “title” property to set captions. We created that column above as well:

for node in viz_graph.nodes:
    node.caption = str(node.properties["TITLE"])

Let’s see it all in action. You will notice that we are using the same projection syntax to create our visualization

from neo4j_viz.snowflake import from_snowflake
from snowflake.snowpark.context import get_active_session

session = get_active_session()

viz_graph = from_snowflake(
    session,
    {
    'nodeTables': ['NEO4J_PATIENT_DB.public.KidneyPatient_viz_vw',
                   'NEO4J_PATIENT_DB.public.procedures_viz_vw'
    ],
    'relationshipTables': {
      'NEO4J_PATIENT_DB.public.procedures_patients_vw': {
        'sourceTable': 'NEO4J_PATIENT_DB.public.KidneyPatient_viz_vw',
        'targetTable': 'NEO4J_PATIENT_DB.public.procedures_viz_vw'
      }
    }
    }
)

viz_graph.color_nodes(property='LABEL', override=True)

for node in viz_graph.nodes:
    node.caption = str(node.properties["TITLE"])

html_object = viz_graph.render()

import streamlit.components.v1 as components

components.html(html_object.data, height=600)

Find Similar Patients

Now we are finally at the step where we create a projection, run our algorithms, and write back to snowflake. We will run louvain to determine communities within our data. Louvain identifies communities by grouping together nodes that have more connections to each other than to nodes outside the group.

You can find more information about writing this function in our documentation.

Broadly, you will need a few things:

A table for nodes, which for us is neo4j_patient_db.public.KidneyPatients_vw
A table for relationships, which for us is neo4j_patient_db.public.KidneyPatientProcedure_relationship_vw
The size of the compute pool you would like to use, which would be CPU_X64_XS
A table to output results, which would be neo4j_patient_db.public.PATIENT_PROCEDURE_SIMILARITY
A node label for our nodes, which would be KidneyPatients_vw

CALL neo4j_graph_analytics.graph.node_similarity('CPU_X64_L', {
  'project': {
    'defaultTablePrefix': 'neo4j_patient_db.public',
    'nodeTables': ['KidneyPatients_vw','KidneyPatientProcedures_vw'],
    'relationshipTables': {
      'KidneyPatientProcedure_relationship_vw': {
        'sourceTable': 'KidneyPatients_vw',
        'targetTable': 'KidneyPatientProcedures_vw'
      }
    }
  },
  'compute': { 'topK': 10,
                'similarityCutoff': 0.3,
                'similarityMetric': 'JACCARD'
            },
  'write': [
    {
    'sourceLabel': 'KidneyPatients_vw',
    'targetLabel': 'KidneyPatients_vw',
    'relationshipProperty': 'similarity',
    'outputTable':  'neo4j_patient_db.public.PATIENT_PROCEDURE_SIMILARITY'
    }
  ]
});

Let’s take a look at the results!

SELECT SOURCENODEID, TARGETNODEID, SIMILARITY
FROM NEO4J_PATIENT_DB.PUBLIC.PATIENT_PROCEDURE_SIMILARITY
LIMIT 10;

SOURCENODEID	TARGETNODEID	SIMILARITY
00094d82-80de-4444-69cf-cf26fd56b1ac	87069a85-19bf-7d81-0cfe-609b3427d096	0.7555555556
00094d82-80de-4444-69cf-cf26fd56b1ac	e4dbe477-7ad6-db78-79eb-87a82b448787	0.8536585366
00094d82-80de-4444-69cf-cf26fd56b1ac	fe12b9bd-82c4-de9e-f5fb-1a292dc5020d	0.8181818182
00094d82-80de-4444-69cf-cf26fd56b1ac	a08d93d4-8fe9-7cb5-cca3-ecaed719f167	0.875
00094d82-80de-4444-69cf-cf26fd56b1ac	41078972-c38f-d0e5-b835-5f1ec4b69e58	0.9487179487
00094d82-80de-4444-69cf-cf26fd56b1ac	a1ab30fc-76db-e043-0368-dde512b707d6	0.85
00094d82-80de-4444-69cf-cf26fd56b1ac	68d9406d-f0f8-a93a-377b-d9ff4be171a5	0.8
00094d82-80de-4444-69cf-cf26fd56b1ac	9048c50d-cf90-be81-5352-5f11335b6533	0.8536585366
00094d82-80de-4444-69cf-cf26fd56b1ac	752404e8-efac-b9bb-c9da-e64c5fe3948f	0.8
00094d82-80de-4444-69cf-cf26fd56b1ac	e57de607-b7c4-d944-9e72-7d786a8a7058	0.85

SOURCENODEID

TARGETNODEID

SIMILARITY