Detecting Insurance Fraud Using Graph Algorithms with Neo4j

Neo4j helps insurance organizations uncover hidden connections and subtle behavioral similarities across claims, policies, and entities at scale. Neo4j Graph Analytics for Snowflake brings the power of graph embeddings directly into Snowflake, enabling teams to represent complex fraud behavior as vectors and use k-nearest neighbors to identify suspiciously similar claims, accounts, and actors. With 65+ ready-to-use graph algorithms, organizations can surface emerging fraud rings and coordinated activity—without moving data or leaving Snowflake.

Prerequisites

The Native App Neo4j Graph Analytics for Snowflake

What You Will Need:

  • A Snowflake account with appropriate access to databases and schemas.

  • Neo4j Graph Analytics application installed from the Snowflake marketplace. Access the marketplace via the menu bar on the left hand side of your screen, as seen below:

marketplace

What You Will Build:

  • A method to compare complex insurance cases to one another and identify the ones that are at risk of being fraudulent.

What You Will Learn:

  • How to prepare and project your data for graph analytics

  • How to use Weakly Connected Components to identify potential clusters of fraudulent activity

  • How to create node embeddings to understand the structure of the graph

  • How to use K-nearest neighbors algorithm to find highly similar nodes in the graph

  • How to read and write directly from and to your Snowflake tables

Loading the Data

This dataset is designed to model and analyze insurance claims for the purpose of identifying fraudulent activity using graph analytics. Given the complexity of the data model, all contextual relationships captured in the graph will be leveraged when comparing claims. This will enable deeper insights beyond isolated data points.

For the purposes of the demo, the database will be named I_DEMO. Using the CSV, insurance_claims_full.csv, found here, we are going to create a new table called insurance_claims_full via the Snowsight data upload method.

Follow through this Snowflake documentation on creating a table from ‘Load data using the web interface’.

In the pop up, 1. Upload the CSV insurance_claims_full.csv using the browse option. 2. Under Select or create a database and schema, please create a database with name I_DEMO. 3. Under Select or create a table, please click on the ‘+’ symbol and create a new table named insurance_claims_full.

Now, a new table named insurance_claims_full will be created under i_demo.public with the provided CSV.

Import the Notebook

  • We’ve provided a Colab notebook to walk you through each SQL and Python step—no local setup required!

  • Download the .ipynb found here, and import the notebook into snowflake.

Permissions

One of the most usefull aspects of Snowflake is the ability to have roles with specific permissions, so that you can have many people working in the same database without worrying about security. The Neo4j app requires the creation of a few different roles. But before we get started granting different roles, we need to ensure that you are using accountadmin to grant and create roles. Lets do that now:

USE ROLE ACCOUNTADMIN;

Next we can set up the necessary roles, permissions, and resource access to enable Graph Analytics to operate on the demo data within the i_demo.public schema (this schema is where the data will be stored by default).

We will create a consumer role (gds_role) for users and administrators, grant the gds_role and GDS application access to read from and write to tables and views, and ensure the future tables are accessible. We will also provide the application with access to the compute pool and warehouse resources required to run the graph algorithms at scale.

-- Create an account role to manage the GDS application
CREATE ROLE IF NOT EXISTS gds_role;
GRANT APPLICATION ROLE neo4j_graph_analytics.app_user TO ROLE gds_role;
GRANT APPLICATION ROLE neo4j_graph_analytics.app_admin TO ROLE gds_role;

--Grant permissions for the application to use the database
GRANT USAGE ON DATABASE i_demo TO APPLICATION neo4j_graph_analytics;
GRANT USAGE ON SCHEMA i_demo.public TO APPLICATION neo4j_graph_analytics;

--Create a database role to manage table and view access
CREATE DATABASE ROLE IF NOT EXISTS gds_db_role;

GRANT ALL PRIVILEGES ON FUTURE TABLES IN SCHEMA i_demo.public TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA i_demo.public TO DATABASE ROLE gds_db_role;

GRANT ALL PRIVILEGES ON FUTURE VIEWS IN SCHEMA i_demo.public TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON ALL VIEWS IN SCHEMA i_demo.public TO DATABASE ROLE gds_db_role;

GRANT CREATE TABLE ON SCHEMA i_demo.public TO DATABASE ROLE gds_db_role;


--Grant the DB role to the application and admin user
GRANT DATABASE ROLE gds_db_role TO APPLICATION neo4j_graph_analytics;
GRANT DATABASE ROLE gds_db_role TO ROLE gds_role;

GRANT USAGE ON DATABASE I_DEMO TO ROLE GDS_ROLE;
GRANT USAGE ON SCHEMA I_DEMO.PUBLIC TO ROLE GDS_ROLE;

GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA I_DEMO.PUBLIC TO ROLE GDS_ROLE;
GRANT CREATE TABLE ON SCHEMA I_DEMO.PUBLIC TO ROLE GDS_ROLE;
GRANT SELECT, INSERT, UPDATE, DELETE ON FUTURE TABLES IN SCHEMA I_DEMO.PUBLIC TO ROLE GDS_ROLE;

Now we will switch to the role we just created:

use warehouse NEO4J_GRAPH_ANALYTICS_APP_WAREHOUSE;
use role gds_role;
use database i_demo;
use schema public;

Cleaning Our Data

Duration 5

We need our data to be in a particular format in order to work with Graph Analytics. In general, it should be like so:

For the tables representing nodes: The first column should be called nodeId, which uniquely identifies each node in the graph.

For the tables representing relationships: We need to have columns called sourceNodeId and targetNodeId, representing the start and end nodes of each relationship.

To get ready for Graph Analytics, reshape your tables as follows:

For Nodes

  • Policy — one node for each unique policy (policy_number).

  • PoliceReportAvailability — represents whether a police report was available.

  • Witnesses — represents witness status related to claims.

  • VehicleMake — describes the make of the vehicle involved.

  • PolicyState — represents the U.S. state where the policy was issued.

  • PolicyCSL — describes the liability coverage (Combined Single Limit) of the policy.

  • TotalClaimAmountBucket — categorizes claim severity into Low, Medium, or High.

  • MonthsAsCustomerBucket — groups policyholders into Short, Medium, or Long tenure.

For Relationships

  • (Policy)-[:HAS_POLICE_REPORT]->(PoliceReportAvailability) — links each policy to whether a police report was available.

  • (Policy)-[:HAS_WITNESSES]->(Witnesses) — links each policy to its witness status.

  • (Policy)-[:INVOLVES_VEHICLE]->(VehicleMake) — connects each policy to the vehicle make.

  • (Policy)-[:REGISTERED_IN]->(PolicyState) — connects each policy to the issuing state.

  • (Policy)-[:HAS_CSL]->(PolicyCSL) — connects each policy to its CSL coverage level.

  • (Policy)-[:HAS_CLAIM_AMOUNT_BUCKET]->(TotalClaimAmountBucket) — links each policy to its total claim amount bucket.

  • (Policy)-[:HAS_CUSTOMER_TENURE_BUCKET]->(MonthsAsCustomerBucket) — links each policy to its customer tenure bucket.

CREATE OR REPLACE TABLE node_policies AS
SELECT DISTINCT policy_number
FROM i_demo.public.insurance_claims_full;


CREATE OR REPLACE TABLE node_police_report_available AS
SELECT DISTINCT police_report_available
FROM i_demo.public.insurance_claims_full;


CREATE OR REPLACE TABLE node_witnesses AS
SELECT DISTINCT witnesses
FROM i_demo.public.insurance_claims_full;


CREATE OR REPLACE TABLE node_auto_make AS
SELECT DISTINCT auto_make
FROM i_demo.public.insurance_claims_full;


CREATE OR REPLACE TABLE node_policy_state AS
SELECT DISTINCT policy_state
FROM i_demo.public.insurance_claims_full;


CREATE OR REPLACE TABLE node_policy_csl AS
SELECT DISTINCT policy_csl
FROM i_demo.public.insurance_claims_full;


CREATE OR REPLACE TABLE policy_states AS
SELECT ROW_NUMBER() OVER (ORDER BY policy_state) AS state_id, policy_state
FROM (
  SELECT DISTINCT policy_state FROM i_demo.public.insurance_claims_full
);

CREATE OR REPLACE TABLE node_total_claim_amount_bucket AS
SELECT DISTINCT
  CASE
    WHEN total_claim_amount < 40000 THEN 'Low'
    WHEN total_claim_amount BETWEEN 40000 AND 70000 THEN 'Medium'
    WHEN total_claim_amount > 70000 THEN 'High'
    ELSE 'Unknown'
  END AS total_claim_amount_bucket
FROM i_demo.public.insurance_claims_full;

CREATE OR REPLACE TABLE node_months_as_customer_bucket AS
SELECT DISTINCT
  CASE
    WHEN months_as_customer < 100 THEN 'Short (<100m)'
    WHEN months_as_customer BETWEEN 100 AND 300 THEN 'Medium (100-300m)'
    WHEN months_as_customer > 300 THEN 'Long (>300m)'
    ELSE 'Unknown'
  END AS months_as_customer_bucket
FROM i_demo.public.insurance_claims_full;

Now, we will merge all the node tables to a single all_nodes_tbl

CREATE OR REPLACE TABLE all_nodes AS
SELECT DISTINCT policy_number::STRING AS nodeid FROM node_policies
UNION
SELECT DISTINCT police_report_available::STRING AS nodeid FROM node_police_report_available
UNION
SELECT DISTINCT witnesses::STRING AS nodeid FROM node_witnesses
UNION
SELECT DISTINCT total_claim_amount_bucket::STRING AS nodeid FROM node_total_claim_amount_bucket
UNION
SELECT DISTINCT auto_make::STRING AS nodeid FROM node_auto_make
UNION
SELECT DISTINCT policy_state::STRING AS nodeid FROM node_policy_state
UNION
SELECT DISTINCT policy_csl::STRING AS nodeid FROM node_policy_csl
UNION
SELECT DISTINCT months_as_customer_bucket::STRING AS nodeid FROM node_months_as_customer_bucket;

Below we create the relationship tables:

CREATE OR REPLACE TABLE rel_policy_police_report_available AS
SELECT
  policy_number,
  police_report_available
FROM i_demo.public.insurance_claims_full;

CREATE OR REPLACE TABLE rel_policy_witnesses AS
SELECT
  policy_number,
  witnesses
FROM i_demo.public.insurance_claims_full;

CREATE OR REPLACE TABLE rel_policy_auto_make AS
SELECT
  policy_number,
  auto_make
FROM i_demo.public.insurance_claims_full;

CREATE OR REPLACE TABLE rel_policy_policy_state AS
SELECT
  policy_number,
  policy_state
FROM i_demo.public.insurance_claims_full;

CREATE OR REPLACE TABLE rel_policy_policy_csl AS
SELECT
  policy_number,
  policy_csl
FROM i_demo.public.insurance_claims_full;

CREATE OR REPLACE TABLE rel_policy_total_claim_amount_bucket AS
SELECT
  policy_number,
  CASE
    WHEN total_claim_amount < 40000 THEN 'Low'
    WHEN total_claim_amount BETWEEN 40000 AND 70000 THEN 'Medium'
    WHEN total_claim_amount > 70000 THEN 'High'
    ELSE 'Unknown'
  END AS total_claim_amount_bucket
FROM i_demo.public.insurance_claims_full;

CREATE OR REPLACE TABLE rel_policy_months_as_customer_bucket AS
SELECT
  policy_number,
  CASE
    WHEN months_as_customer < 100 THEN 'Short (<100m)'
    WHEN months_as_customer BETWEEN 100 AND 300 THEN 'Medium (100-300m)'
    WHEN months_as_customer > 300 THEN 'Long (>300m)'
    ELSE 'Unknown'
  END AS months_as_customer_bucket
FROM i_demo.public.insurance_claims_full;

We will merge all relationships into one big relationship table for easier analyses.

CREATE OR REPLACE TABLE all_relationships AS
SELECT policy_number::STRING AS sourcenodeid, police_report_available::STRING AS targetnodeid
FROM rel_policy_police_report_available
UNION
SELECT policy_number::STRING, witnesses::STRING
FROM rel_policy_witnesses
UNION
SELECT policy_number::STRING, total_claim_amount_bucket::STRING
FROM rel_policy_total_claim_amount_bucket
UNION
SELECT policy_number::STRING, auto_make::STRING
FROM rel_policy_auto_make
UNION
SELECT policy_number::STRING, policy_state::STRING
FROM rel_policy_policy_state
UNION
SELECT policy_number::STRING, policy_csl::STRING
FROM rel_policy_policy_csl
UNION
SELECT policy_number::STRING, months_as_customer_bucket::STRING
FROM rel_policy_months_as_customer_bucket;

Insurance Claims Embeddings and Similarity

Uncovering complex fraud patterns in insurance claims requires more than tracing obvious links between entities. To detect subtle signals of collusion or anomalous behavior, we turn to structural embeddings — numerical summaries that capture how each claim fits within the broader network.

By transforming the graph structure into a vector space, we can:

  • Detect clusters of claims that fulfill similar structural roles

  • Surface outliers whose behavior deviates from typical claim patterns

  • Flag candidates for further review based on similarity to known fraudulent activity

Our approach leverages two key graph algorithms:

Fast Random Projection (FastRP): This algorithm generates a concise 16-dimensional vector for each claim, reflecting the shape of its surrounding network. Claims embedded in similar structural contexts — such as being part of a fraud ring — will yield similar vectors.

K-Nearest Neighbors (KNN): Once embeddings are in place, KNN finds the most structurally similar claims using cosine similarity. This allows us to identify networks of claims that may not be directly connected but exhibit comparable behavior.

By combining structural embeddings with similarity search, we move beyond surface-level connections and begin to model how fraud operates across the entire claims graph.

You can find more information about these algorithms in our documentation.

Fast Random Projection (FastRP)

Fraud patterns often hide behind complex, indirect relationships. FastRP allows us to translate each claim’s graph position into a compact vector — a structural fingerprint that captures its role in the broader claims network.

These embeddings aren’t directly interpretable, but when two claims have very similar embeddings, it strongly suggests they occupy comparable positions in the network. They may share the same types of connections to incidents, entities, or locations — potentially indicating coordinated behavior or copycat strategies.

We compute embeddings as follows:

CALL Neo4j_Graph_Analytics.graph.fast_rp('CPU_X64_XS', {
  'project': {
    'defaultTablePrefix': 'i_demo.public',
    'nodeTables': ['all_nodes'],
    'relationshipTables': {
      'all_relationships': {
        'sourceTable': 'all_nodes',
        'targetTable': 'all_nodes',
        'orientation': 'UNDIRECTED'

      }
    }
  },
  'compute': {
    'mutateProperty': 'embedding',
    'embeddingDimension': 128,
    'randomSeed': 1234
  },
  'write': [{
    'nodeLabel': 'all_nodes',
    'outputTable': 'i_demo.public.all_nodes_fast_rp',
    'nodeProperty': 'embedding'
  }]
});

We can take a look at our embeddings like so:

SELECT
  nodeid,
  embedding
FROM i_demo.public.all_nodes_fast_rp;
NODEID EMBEDDING

514065

[ 7.488563656806946e-03, -8.751728385686874e-02, …]

235220

[ 1.217467859387398e-01, -1.637900769710541e-01, …]

420815

[ 5.846929550170898e-02, -8.481042832136154e-02, …]

Now that we have generated node embeddings, we can now proceed to use these in KNN similarity detection algorithm.

K-Nearest Neighbors (KNN)

With embeddings in place, KNN helps us find structurally similar claims — even if they’re not directly connected. It compares the cosine similarity of embeddings to rank the top matches for each node.

This is especially useful in fraud detection, where collusive claims may appear unrelated on the surface but exhibit parallel structural behavior: similar entity relationships, involvement in incidents with mirrored patterns, or indirect ties to the same clusters of providers.

In the context of cosine similarity in the KNN algorithm, a score of:

  • 1.0 means the vectors point in exactly the same direction (perfect similarity).

  • 0.0 means orthogonal (no similarity).

  • –1.0 means completely opposite.

CALL Neo4j_Graph_Analytics.graph.knn('CPU_X64_XS', {
  'project': {
    'defaultTablePrefix': 'i_demo.public',
    'nodeTables': [ 'all_nodes_fast_rp' ],
    'relationshipTables': {}
  },
  'compute': {
    'nodeProperties': ['EMBEDDING'],
    'topK': 3,
    'mutateProperty': 'score',
    'mutateRelationshipType': 'SIMILAR_TO'
  },
  'write': [{
    'outputTable': 'i_demo.public.claims_knn_similarity',
    'sourceLabel': 'all_nodes_fast_rp',
    'targetLabel': 'all_nodes_fast_rp',
    'relationshipType': 'SIMILAR_TO',
    'relationshipProperty': 'score'
  }]
});

And now we look at the results:

SELECT
    score,
    COUNT(*) AS row_count
FROM i_demo.public.claims_knn_similarity
GROUP BY score
ORDER BY score DESC
SCORE ROW_COUNT

1

144

0.9910849541836564

2

0.9909591414045124

2

0.990797909587815

2

0.9907855573714457

2

0.990772737863948

2

0.9907504528107439

2

In our example dataset, the KNN results show that many nodes have very high structural similarity scores (mostly above 0.92), indicating they occupy very similar positions in the graph. This suggests that these claims or entities may share common patterns or connections, potentially signaling coordinated behavior. High-scoring pairs are good candidates for closer review to detect possible collusion or fraud.

Finding Additional Fraud

We now have pairwise similarity scores between different claims. Let’s take a look at our original table and find claims that appear to be structurally the same as fraudulent claims. We are looking for claims that satisfy two conditions:

  1. They are not currently marked as fraudulent

  2. They have a knn score of “1” with a claim that has already been marked as fraudulent

This can give us an idea of the universe of potentially missed fraudulent claims.

SELECT
  icf.policy_number,
  icf.fraud_reported
FROM I_DEMO.PUBLIC.INSURANCE_CLAIMS_FULL icf
JOIN I_DEMO.PUBLIC.CLAIMS_KNN_SIMILARITY knn
  ON CAST(icf.policy_number AS VARCHAR) = knn.targetnodeid
WHERE knn.score = 1
  AND icf.fraud_reported <> 'Y'
  AND EXISTS (
    SELECT 1
    FROM I_DEMO.PUBLIC.INSURANCE_CLAIMS_FULL icf_src
    WHERE CAST(icf_src.policy_number AS VARCHAR) = knn.sourcenodeid
      AND icf_src.fraud_reported = 'Y'
  );
POLICY_NUMBER FRAUD_REPORTED

866805

FALSE

804219

FALSE

795004

FALSE

731450

FALSE

116700

FALSE

Conclusion and Resources

Duration 2

In this quickstart, you learned how to bring the power of graph insights into Snowflake using Neo4j Graph Analytics.

What You Learned

By working with a Insurance Claims dataset, you were able to:

  1. Set up the Neo4j Graph Analytics application within Snowflake.

  2. Prepare and project your data into a graph model (users as nodes, transactions as relationships).

  3. Ran Weakly Connected Components to identify potential clusters of fraudulent activity.

  4. Ran Node Embeddings and K Nearest Neighbors to identify the structure of nodes in the graph and identify highly similar claims.