Detecting Insurance Fraud Using Graph Algorithms with Neo4j
Neo4j helps insurance organizations uncover hidden connections and subtle behavioral similarities across claims, policies, and entities at scale. Neo4j Graph Analytics for Snowflake brings the power of graph embeddings directly into Snowflake, enabling teams to represent complex fraud behavior as vectors and use k-nearest neighbors to identify suspiciously similar claims, accounts, and actors. With 65+ ready-to-use graph algorithms, organizations can surface emerging fraud rings and coordinated activity—without moving data or leaving Snowflake.
Prerequisites
The Native App Neo4j Graph Analytics for Snowflake
What You Will Need:
-
A Snowflake account with appropriate access to databases and schemas.
-
Neo4j Graph Analytics application installed from the Snowflake marketplace. Access the marketplace via the menu bar on the left hand side of your screen, as seen below:

What You Will Build:
-
A method to compare complex insurance cases to one another and identify the ones that are at risk of being fraudulent.
What You Will Learn:
-
How to prepare and project your data for graph analytics
-
How to use Weakly Connected Components to identify potential clusters of fraudulent activity
-
How to create node embeddings to understand the structure of the graph
-
How to use K-nearest neighbors algorithm to find highly similar nodes in the graph
-
How to read and write directly from and to your Snowflake tables
Loading the Data
This dataset is designed to model and analyze insurance claims for the purpose of identifying fraudulent activity using graph analytics. Given the complexity of the data model, all contextual relationships captured in the graph will be leveraged when comparing claims. This will enable deeper insights beyond isolated data points.
For the purposes of the demo, the database will be named I_DEMO.
Using the CSV, insurance_claims_full.csv, found
here,
we are going to create a new table called
insurance_claims_full via the Snowsight data upload method.
Follow through this Snowflake documentation on creating a table from ‘Load data using the web interface’.
In the pop up, 1. Upload the CSV insurance_claims_full.csv
using the browse option. 2. Under
Select or create a database and schema, please create a database with
name I_DEMO. 3. Under Select or create a table, please click on
the ‘+’ symbol and create a new table named
insurance_claims_full.
Now, a new table named insurance_claims_full will be created
under i_demo.public with the provided CSV.
Import the Notebook
-
We’ve provided a Colab notebook to walk you through each SQL and Python step—no local setup required!
-
Download the .ipynb found here, and import the notebook into snowflake.
Permissions
One of the most usefull aspects of Snowflake is the ability to have
roles with specific permissions, so that you can have many people
working in the same database without worrying about security. The Neo4j
app requires the creation of a few different roles. But before we get
started granting different roles, we need to ensure that you are using
accountadmin to grant and create roles. Lets do that now:
USE ROLE ACCOUNTADMIN;
Next we can set up the necessary roles, permissions, and resource access
to enable Graph Analytics to operate on the demo data within the
i_demo.public schema (this schema is where the data will be stored
by default).
We will create a consumer role (gds_role) for users and administrators, grant the gds_role and GDS application access to read from and write to tables and views, and ensure the future tables are accessible. We will also provide the application with access to the compute pool and warehouse resources required to run the graph algorithms at scale.
-- Create an account role to manage the GDS application CREATE ROLE IF NOT EXISTS gds_role; GRANT APPLICATION ROLE neo4j_graph_analytics.app_user TO ROLE gds_role; GRANT APPLICATION ROLE neo4j_graph_analytics.app_admin TO ROLE gds_role; --Grant permissions for the application to use the database GRANT USAGE ON DATABASE i_demo TO APPLICATION neo4j_graph_analytics; GRANT USAGE ON SCHEMA i_demo.public TO APPLICATION neo4j_graph_analytics; --Create a database role to manage table and view access CREATE DATABASE ROLE IF NOT EXISTS gds_db_role; GRANT ALL PRIVILEGES ON FUTURE TABLES IN SCHEMA i_demo.public TO DATABASE ROLE gds_db_role; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA i_demo.public TO DATABASE ROLE gds_db_role; GRANT ALL PRIVILEGES ON FUTURE VIEWS IN SCHEMA i_demo.public TO DATABASE ROLE gds_db_role; GRANT ALL PRIVILEGES ON ALL VIEWS IN SCHEMA i_demo.public TO DATABASE ROLE gds_db_role; GRANT CREATE TABLE ON SCHEMA i_demo.public TO DATABASE ROLE gds_db_role; --Grant the DB role to the application and admin user GRANT DATABASE ROLE gds_db_role TO APPLICATION neo4j_graph_analytics; GRANT DATABASE ROLE gds_db_role TO ROLE gds_role; GRANT USAGE ON DATABASE I_DEMO TO ROLE GDS_ROLE; GRANT USAGE ON SCHEMA I_DEMO.PUBLIC TO ROLE GDS_ROLE; GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA I_DEMO.PUBLIC TO ROLE GDS_ROLE; GRANT CREATE TABLE ON SCHEMA I_DEMO.PUBLIC TO ROLE GDS_ROLE; GRANT SELECT, INSERT, UPDATE, DELETE ON FUTURE TABLES IN SCHEMA I_DEMO.PUBLIC TO ROLE GDS_ROLE;
Now we will switch to the role we just created:
use warehouse NEO4J_GRAPH_ANALYTICS_APP_WAREHOUSE; use role gds_role; use database i_demo; use schema public;
Cleaning Our Data
Duration 5
We need our data to be in a particular format in order to work with Graph Analytics. In general, it should be like so:
For the tables representing nodes: The first column should be called
nodeId, which uniquely identifies each node in the graph.
For the tables representing relationships: We need to have columns
called sourceNodeId and targetNodeId, representing the start and end
nodes of each relationship.
To get ready for Graph Analytics, reshape your tables as follows:
For Nodes
-
Policy — one node for each unique policy (policy_number).
-
PoliceReportAvailability — represents whether a police report was available.
-
Witnesses — represents witness status related to claims.
-
VehicleMake — describes the make of the vehicle involved.
-
PolicyState — represents the U.S. state where the policy was issued.
-
PolicyCSL — describes the liability coverage (Combined Single Limit) of the policy.
-
TotalClaimAmountBucket — categorizes claim severity into Low, Medium, or High.
-
MonthsAsCustomerBucket — groups policyholders into Short, Medium, or Long tenure.
For Relationships
-
(Policy)-[:HAS_POLICE_REPORT]->(PoliceReportAvailability) — links each policy to whether a police report was available.
-
(Policy)-[:HAS_WITNESSES]->(Witnesses) — links each policy to its witness status.
-
(Policy)-[:INVOLVES_VEHICLE]->(VehicleMake) — connects each policy to the vehicle make.
-
(Policy)-[:REGISTERED_IN]->(PolicyState) — connects each policy to the issuing state.
-
(Policy)-[:HAS_CSL]->(PolicyCSL) — connects each policy to its CSL coverage level.
-
(Policy)-[:HAS_CLAIM_AMOUNT_BUCKET]->(TotalClaimAmountBucket) — links each policy to its total claim amount bucket.
-
(Policy)-[:HAS_CUSTOMER_TENURE_BUCKET]->(MonthsAsCustomerBucket) — links each policy to its customer tenure bucket.
CREATE OR REPLACE TABLE node_policies AS
SELECT DISTINCT policy_number
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE node_police_report_available AS
SELECT DISTINCT police_report_available
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE node_witnesses AS
SELECT DISTINCT witnesses
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE node_auto_make AS
SELECT DISTINCT auto_make
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE node_policy_state AS
SELECT DISTINCT policy_state
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE node_policy_csl AS
SELECT DISTINCT policy_csl
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE policy_states AS
SELECT ROW_NUMBER() OVER (ORDER BY policy_state) AS state_id, policy_state
FROM (
SELECT DISTINCT policy_state FROM i_demo.public.insurance_claims_full
);
CREATE OR REPLACE TABLE node_total_claim_amount_bucket AS
SELECT DISTINCT
CASE
WHEN total_claim_amount < 40000 THEN 'Low'
WHEN total_claim_amount BETWEEN 40000 AND 70000 THEN 'Medium'
WHEN total_claim_amount > 70000 THEN 'High'
ELSE 'Unknown'
END AS total_claim_amount_bucket
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE node_months_as_customer_bucket AS
SELECT DISTINCT
CASE
WHEN months_as_customer < 100 THEN 'Short (<100m)'
WHEN months_as_customer BETWEEN 100 AND 300 THEN 'Medium (100-300m)'
WHEN months_as_customer > 300 THEN 'Long (>300m)'
ELSE 'Unknown'
END AS months_as_customer_bucket
FROM i_demo.public.insurance_claims_full;
Now, we will merge all the node tables to a single
all_nodes_tbl
CREATE OR REPLACE TABLE all_nodes AS SELECT DISTINCT policy_number::STRING AS nodeid FROM node_policies UNION SELECT DISTINCT police_report_available::STRING AS nodeid FROM node_police_report_available UNION SELECT DISTINCT witnesses::STRING AS nodeid FROM node_witnesses UNION SELECT DISTINCT total_claim_amount_bucket::STRING AS nodeid FROM node_total_claim_amount_bucket UNION SELECT DISTINCT auto_make::STRING AS nodeid FROM node_auto_make UNION SELECT DISTINCT policy_state::STRING AS nodeid FROM node_policy_state UNION SELECT DISTINCT policy_csl::STRING AS nodeid FROM node_policy_csl UNION SELECT DISTINCT months_as_customer_bucket::STRING AS nodeid FROM node_months_as_customer_bucket;
Below we create the relationship tables:
CREATE OR REPLACE TABLE rel_policy_police_report_available AS
SELECT
policy_number,
police_report_available
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE rel_policy_witnesses AS
SELECT
policy_number,
witnesses
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE rel_policy_auto_make AS
SELECT
policy_number,
auto_make
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE rel_policy_policy_state AS
SELECT
policy_number,
policy_state
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE rel_policy_policy_csl AS
SELECT
policy_number,
policy_csl
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE rel_policy_total_claim_amount_bucket AS
SELECT
policy_number,
CASE
WHEN total_claim_amount < 40000 THEN 'Low'
WHEN total_claim_amount BETWEEN 40000 AND 70000 THEN 'Medium'
WHEN total_claim_amount > 70000 THEN 'High'
ELSE 'Unknown'
END AS total_claim_amount_bucket
FROM i_demo.public.insurance_claims_full;
CREATE OR REPLACE TABLE rel_policy_months_as_customer_bucket AS
SELECT
policy_number,
CASE
WHEN months_as_customer < 100 THEN 'Short (<100m)'
WHEN months_as_customer BETWEEN 100 AND 300 THEN 'Medium (100-300m)'
WHEN months_as_customer > 300 THEN 'Long (>300m)'
ELSE 'Unknown'
END AS months_as_customer_bucket
FROM i_demo.public.insurance_claims_full;
We will merge all relationships into one big relationship table for easier analyses.
CREATE OR REPLACE TABLE all_relationships AS SELECT policy_number::STRING AS sourcenodeid, police_report_available::STRING AS targetnodeid FROM rel_policy_police_report_available UNION SELECT policy_number::STRING, witnesses::STRING FROM rel_policy_witnesses UNION SELECT policy_number::STRING, total_claim_amount_bucket::STRING FROM rel_policy_total_claim_amount_bucket UNION SELECT policy_number::STRING, auto_make::STRING FROM rel_policy_auto_make UNION SELECT policy_number::STRING, policy_state::STRING FROM rel_policy_policy_state UNION SELECT policy_number::STRING, policy_csl::STRING FROM rel_policy_policy_csl UNION SELECT policy_number::STRING, months_as_customer_bucket::STRING FROM rel_policy_months_as_customer_bucket;
Insurance Claims Embeddings and Similarity
Uncovering complex fraud patterns in insurance claims requires more than tracing obvious links between entities. To detect subtle signals of collusion or anomalous behavior, we turn to structural embeddings — numerical summaries that capture how each claim fits within the broader network.
By transforming the graph structure into a vector space, we can:
-
Detect clusters of claims that fulfill similar structural roles
-
Surface outliers whose behavior deviates from typical claim patterns
-
Flag candidates for further review based on similarity to known fraudulent activity
Our approach leverages two key graph algorithms:
Fast Random Projection (FastRP): This algorithm generates a concise 16-dimensional vector for each claim, reflecting the shape of its surrounding network. Claims embedded in similar structural contexts — such as being part of a fraud ring — will yield similar vectors.
K-Nearest Neighbors (KNN): Once embeddings are in place, KNN finds the most structurally similar claims using cosine similarity. This allows us to identify networks of claims that may not be directly connected but exhibit comparable behavior.
By combining structural embeddings with similarity search, we move beyond surface-level connections and begin to model how fraud operates across the entire claims graph.
You can find more information about these algorithms in our documentation.
Fast Random Projection (FastRP)
Fraud patterns often hide behind complex, indirect relationships. FastRP allows us to translate each claim’s graph position into a compact vector — a structural fingerprint that captures its role in the broader claims network.
These embeddings aren’t directly interpretable, but when two claims have very similar embeddings, it strongly suggests they occupy comparable positions in the network. They may share the same types of connections to incidents, entities, or locations — potentially indicating coordinated behavior or copycat strategies.
We compute embeddings as follows:
CALL Neo4j_Graph_Analytics.graph.fast_rp('CPU_X64_XS', {
'project': {
'defaultTablePrefix': 'i_demo.public',
'nodeTables': ['all_nodes'],
'relationshipTables': {
'all_relationships': {
'sourceTable': 'all_nodes',
'targetTable': 'all_nodes',
'orientation': 'UNDIRECTED'
}
}
},
'compute': {
'mutateProperty': 'embedding',
'embeddingDimension': 128,
'randomSeed': 1234
},
'write': [{
'nodeLabel': 'all_nodes',
'outputTable': 'i_demo.public.all_nodes_fast_rp',
'nodeProperty': 'embedding'
}]
});
We can take a look at our embeddings like so:
SELECT nodeid, embedding FROM i_demo.public.all_nodes_fast_rp;
| NODEID | EMBEDDING |
|---|---|
514065 |
[ 7.488563656806946e-03, -8.751728385686874e-02, …] |
235220 |
[ 1.217467859387398e-01, -1.637900769710541e-01, …] |
420815 |
[ 5.846929550170898e-02, -8.481042832136154e-02, …] |
Now that we have generated node embeddings, we can now proceed to use these in KNN similarity detection algorithm.
K-Nearest Neighbors (KNN)
With embeddings in place, KNN helps us find structurally similar claims — even if they’re not directly connected. It compares the cosine similarity of embeddings to rank the top matches for each node.
This is especially useful in fraud detection, where collusive claims may appear unrelated on the surface but exhibit parallel structural behavior: similar entity relationships, involvement in incidents with mirrored patterns, or indirect ties to the same clusters of providers.
In the context of cosine similarity in the KNN algorithm, a score of:
-
1.0 means the vectors point in exactly the same direction (perfect similarity).
-
0.0 means orthogonal (no similarity).
-
–1.0 means completely opposite.
CALL Neo4j_Graph_Analytics.graph.knn('CPU_X64_XS', {
'project': {
'defaultTablePrefix': 'i_demo.public',
'nodeTables': [ 'all_nodes_fast_rp' ],
'relationshipTables': {}
},
'compute': {
'nodeProperties': ['EMBEDDING'],
'topK': 3,
'mutateProperty': 'score',
'mutateRelationshipType': 'SIMILAR_TO'
},
'write': [{
'outputTable': 'i_demo.public.claims_knn_similarity',
'sourceLabel': 'all_nodes_fast_rp',
'targetLabel': 'all_nodes_fast_rp',
'relationshipType': 'SIMILAR_TO',
'relationshipProperty': 'score'
}]
});
And now we look at the results:
SELECT
score,
COUNT(*) AS row_count
FROM i_demo.public.claims_knn_similarity
GROUP BY score
ORDER BY score DESC
| SCORE | ROW_COUNT |
|---|---|
1 |
144 |
0.9910849541836564 |
2 |
0.9909591414045124 |
2 |
0.990797909587815 |
2 |
0.9907855573714457 |
2 |
0.990772737863948 |
2 |
0.9907504528107439 |
2 |
In our example dataset, the KNN results show that many nodes have very high structural similarity scores (mostly above 0.92), indicating they occupy very similar positions in the graph. This suggests that these claims or entities may share common patterns or connections, potentially signaling coordinated behavior. High-scoring pairs are good candidates for closer review to detect possible collusion or fraud.
Finding Additional Fraud
We now have pairwise similarity scores between different claims. Let’s take a look at our original table and find claims that appear to be structurally the same as fraudulent claims. We are looking for claims that satisfy two conditions:
-
They are not currently marked as fraudulent
-
They have a knn score of “1” with a claim that has already been marked as fraudulent
This can give us an idea of the universe of potentially missed fraudulent claims.
SELECT
icf.policy_number,
icf.fraud_reported
FROM I_DEMO.PUBLIC.INSURANCE_CLAIMS_FULL icf
JOIN I_DEMO.PUBLIC.CLAIMS_KNN_SIMILARITY knn
ON CAST(icf.policy_number AS VARCHAR) = knn.targetnodeid
WHERE knn.score = 1
AND icf.fraud_reported <> 'Y'
AND EXISTS (
SELECT 1
FROM I_DEMO.PUBLIC.INSURANCE_CLAIMS_FULL icf_src
WHERE CAST(icf_src.policy_number AS VARCHAR) = knn.sourcenodeid
AND icf_src.fraud_reported = 'Y'
);
| POLICY_NUMBER | FRAUD_REPORTED |
|---|---|
866805 |
FALSE |
804219 |
FALSE |
795004 |
FALSE |
731450 |
FALSE |
116700 |
FALSE |
Conclusion and Resources
Duration 2
In this quickstart, you learned how to bring the power of graph insights into Snowflake using Neo4j Graph Analytics.
What You Learned
By working with a Insurance Claims dataset, you were able to:
-
Set up the Neo4j Graph Analytics application within Snowflake.
-
Prepare and project your data into a graph model (users as nodes, transactions as relationships).
-
Ran Weakly Connected Components to identify potential clusters of fraudulent activity.
-
Ran Node Embeddings and K Nearest Neighbors to identify the structure of nodes in the graph and identify highly similar claims.