A Graph-Based Approach to Financial Fraud Detection

The Challenge

Financial fraud has moved from simple, isolated incidents to highly sophisticated, interconnected schemes, with criminals using increasingly sophisticated methods to bypass traditional security measures. Standard fraud detection systems often rely on discrete rules or tabular machine learning models that analyze transactions in isolation.

These traditional approaches have served institutions well but can struggle to detect complex, interconnected fraud schemes such as:

Fraud Rings: Organized groups of fraudsters sharing resources (devices, credentials) to commit wide-scale abuse.
Synthetic Identities: Identities cobbled together from real and fake information (e.g., a real SSN with a fake name) to build credit profiles before "busting out".
Layered Money Laundering: Complex chains of transactions designed to obscure the origin of illicit funds.

In these scenarios, the individual transaction often looks legitimate. The fraud signal only becomes apparent when analyzing the network of relationships, shared devices, common IP addresses, or circular money flows.

The Solution

This solution utilizes Neo4j Graph Data Science (GDS) to compute topological signals of fraud. We demonstrate this by ingesting the IEEE-CIS Fraud Detection dataset into a graph, and showing how to uncover hidden connections between seemingly unrelated entities.

Prediction moves beyond simple "blacklists" to identify "Fraud Islands", tightly connected communities of devices and cards that engage in illicit activity, and uses these structural insights to train more accurate machine learning models.

Data Model and Network Structure

The data model connects transactions to the digital footprints left by users. The graph consists of the following nodes: Transaction, Card, Device, Email, Address, and Identity.

This structure allows for the detection of suspicious patterns, such as:

Device Sharing: A single device ID linked to multiple distinct credit cards (a strong signal of a compromised device or a fraud farm).
Identity Reuse: Multiple accounts sharing the same email address or physical address.

Methodology: The Graph Data Science Pipeline

The solution implements a hybrid Machine Learning pipeline where graph features enrich traditional tabular models.

1. Feature Engineering with GDS

Instead of relying solely on transaction amounts and timestamps, the system generates powerful graph-based features using Neo4j GDS algorithms:

PageRank: Measures the relative influence of a node. A device used by many high-value cards will have a high PageRank, signaling potential risk.
Degree Centrality: Counts the number of direct connections. High degree centrality (e.g., one card connected to many emails) often indicates high-velocity suspicious activity.
Louvain Community Detection: Identifies "communities" or clusters within the graph. Fraudulent activity often forms "Fraud Rings", disjoint communities of moderate size that are distinct from the giant component of legitimate users.
FastRP (Node Embeddings): Generates vector representations of nodes, capturing their structural role in the network for use in downstream ML models.

2. Graph-Enhanced Machine Learning

These graph features are extracted and combined with traditional tabular data (Time, Amount) to train an XGBoost Classifier.

As shown above, the Graph-Enhanced Model significantly outperforms the baseline tabular model Precision-Recall (PR-AUC) metrics (orange means better). This demonstrates that structural context allows the model to differentiate between legitimate high-value transactions and actual fraud.

Enterprise Integration Architecture

In a production banking environment, this Fraud Knowledge Graph acts as a central Intelligence Layer:

Hot Path (Real-Time): Transactions are streamed via Kafka or other real-time streaming platforms into Neo4j. The graph instantly updates relationships (e.g., Device-to-Card) and calculates a "Network Risk Score" (e.g., Is this card 3 hops away from a known fraudster?), which is fed to the authorization engine in regularly.
Cold Path (Batch & feedback): Historical data is loaded from data lakes for deep investigation. When analysts confirm a fraud ring in tools like Neo4j Bloom, the "Confirmed Fraud" label is propagated back to the graph, automatically flagging any new accounts that connect to that ring.

Business Benefits

Reduced False Positives: By understanding the context of a transaction (e.g., a user traveling vs. a stolen card), banks can approve more legitimate transactions.
Detection of Novel Attacks: Graph centrality and community detection can identify new fraud rings based on their shape and behavior, even before specific rules are written for them.
Visual Forensics: The graph provides an intuitive way for analysts and forensic investigators to "follow the money" and visually investigate complex rings, drastically reducing investigation time.

Resources

Repository: pedroleitao-neo4j/finance-ieee-cis-fraud
Dataset: IEEE-CIS Fraud Detection Dataset
Documentation: Neo4j Graph Data Science