Synthetic Identity Fraud

A walkthrough of Synthetic Identity Fraud use case

1. Introduction

Synthetic identity theft is a type of fraud where a person combines real and fake identity information to open accounts or make purchases. Real information, such as photos or passport numbers, is often stolen and combined with fake information like a fake name, address, or date of birth. The fraudsters then use this synthetic identity to open credit cards or make fraudulent purchases. This method is sometimes referred to as "Frankenstein identity" due to the pieced-together nature of the identity information.

2. Scenario

Identity fraud is a constantly evolving and widespread form of fraud that poses significant risks to individuals, businesses, and the economy as a whole. The consequences of identity fraud can be devastating, resulting in financial losses, damage to credit history, and the difficult process of restoring a stolen identity.

This type of fraud is becoming increasingly important to catch as any proceeds that are taken from a customer’s account with new legislation being discussed will put additional financial protection for victims, with a liability split between the sending and receiving institutions.

Identity fraud is a pervasive and significant threat that requires banks' attention and proactive measures. By effectively mitigating identity fraud, banks not only protect their customers and maintain their trust but also contribute to the overall security and stability of the financial system.

3. Solution

Neo4j is an advanced graph database that is highly effective in detecting synthetic identities. It can model intricate relationships, conduct proficient link analysis, and recognise patterns. This makes it an ideal solution for quickly and easily uncovering suspicious connections and behaviours that are characteristic of synthetic identity fraud.

3.1. How Graph Databases Can Help?

Implementing Neo4j can help execute analysis that was not previously possible. Examples of the scenarios are:

Link Analysis: Neo4j enables efficient link analysis to explore connections among identities, analysing interrelationships between data points to identify synthetic identity clusters.
Pattern Detection: Neo4j can identify patterns without a starting point, making it ideal for analysing data shapes rather than the data itself.
Real-time: Neo4j enables real-time analysis for quickly identifying and responding to synthetic identity fraud. By continuously processing data, graph databases can detect suspicious activities and generate alerts in near real-time.

4. Modelling

This section will show examples of cypher queries on an example graph. The intention is to illustrate what the queries look like and provide a guide on how to structure your data in a real setting. We will do this on a small graph of several nodes. The example graph will be based on the data model below:

4.1. Data Model

4.1.1 Required Fields

Below are the fields required to get started:

Customer Node:

customerId: Contains the unique system identifier for the customer record.

Passport Node:

passportNumber: Contains the passport number used by Customer

Phone Node:

phoneNumber: Contains the phone number used by Customer

Email Node:

address: Contains the email address used by Customer

4.2. Demo Data

The following Cypher statement will create the example graph in the Neo4j database:

// Create all people nodes
CREATE (c1:Customer {customerId: "CUS001"})
CREATE (c2:Customer {customerId: "CUS002"})
CREATE (c3:Customer {customerId: "CUS003"})
// Create email node
CREATE (email:Email {address: "michael@abc.com"})
// Create phone node
CREATE (phone:Phone {phoneNumber: "9999990304"})
// Create passport node
CREATE (passport:Passport {passportNumber: 1234567890})


// Create email relationship
CREATE (c1)-[:HAS_EMAIL]->(email)<-[:HAS_EMAIL]-(c2)
// Create phone relationship
CREATE (c2)-[:HAS_PHONE]->(phone)<-[:HAS_PHONE]-(c3)
// Create passport relationship
CREATE (c3)-[:HAS_PASSPORT]->(passport)<-[:HAS_PASSPORT]-(c1)

4.3. Neo4j Scheme

If you call:

// Show neo4j scheme
CALL db.schema.visualization()

You will see the following response:

5. Cypher Queries

5.1. Identify Customers sharing the same email:

In this query, we will identify anyone who shares the same email address:

Customer nodes should be connected to the same Email node.
The direction of the relationship has been applied to the traversal.

You can express the same query in a couple of different ways:

// Match all customers sharing an email
MATCH path=(c1:Customer)-[:HAS_EMAIL]->(email)<-[:HAS_EMAIL]-(c2:Customer)
RETURN path

Here you can see that we have provided the labels for the Customer node and specified the exact relationship to follow. You could also get the same results with:

// Match all customers sharing an email
MATCH path=(c1:Customer)-[]->(email:Email)<-[]-(c2:Customer)
RETURN path

The difference here is that this time we have not specified the relationship type to follow, but because we have specified the Email node label, as only one relationship leads to the Email node, we get the same response. If your graph contains multiple relationships connecting a customer to an email, then this query will give you incorrect results.

If you were to provide all labels on nodes and relationships like the query below, you guarantee the correct traversal and ensure you do not get any incorrect results.

// Match all people sharing an email
MATCH path=(c1:Customer)-[:HAS_EMAIL]->(:Email)<-[:HAS_EMAIL]-(c2:Customer)
RETURN path

5.2. Identify customers sharing multiple characteristics:

In this query, we will identify any Customer who shares the same email, phone or passport number with someone else:

Customer nodes should be connected to the same Email node.
Customer nodes should be connected to the same Phone node.
Customer nodes should be connected to the same Passport node.
The direction of the relationship has been applied to the traversal.

// Match all customers sharing an email, phone or passport number
MATCH path=(c1:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_PASSPORT]->(info)<-[:HAS_EMAIL|HAS_PHONE|HAS_PASSPORT]-(c2:Customer)
RETURN path

6. Graph Data Science (GDS)

6.1. Weakly Connected Components

The Weakly Connected Components (WCC) algorithm identifies groups of connected nodes in both directed and undirected graphs. Nodes are considered connected if there is a path between them, and a component is formed by all the nodes that are connected to each other.

The reason to use this algorithm is that it identifies clusters of connected nodes with similar attributes, such as Email, Phone, or Passport. It generates a community ID that can be reused in future investigations, providing valuable insights into the data and its surrounding communities.

6.1.1 Create Monopartite Graph

The WCC algorithm can only be applied on monopartite graphs with only one node label. In our case, the node label will be Customer. We must modify the graph to make the data compatible with the WCC algorithm. To do so, we can use the query below to establish a new relationship called LINKED, which will be used by the algorithm.

// Match all customers sharing an email, phone or passport number
MATCH (c1:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_PASSPORT]->(info)<-[:HAS_EMAIL|HAS_PHONE|HAS_PASSPORT]-(c2:Customer)
WHERE ID(c1) > ID(c2)
CREATE (c1)-[:LINKED]->(c2)

The query above modifies the data model and updates it to appear as follows:

fs synthetic identity fraud gds data model

6.1.2 Graph Projection

To start running any Graph Data Science algorithm, you first need to project a part of the graph. This will enable you to analyse the data in the projection effectively.

CALL gds.graph.project(
    // graph projection name
    'myGraph',
    // nodes to import into projection
    'Customer',
    // relationship to import into projection
    'LINKED'
)

6.1.2 GDS Stream

When using the stream execution mode, the algorithm will provide the component ID for every node. This allows for direct inspection of results or post-processing in Cypher, without any negative impact. By ordering the results, nodes belonging to the same component can be displayed together for easier analysis.

CALL gds.wcc.stream('myGraph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).customerId AS customerId, componentId
ORDER BY componentId, customerId

6.1.3 GDS Write

By using the "write" execution mode, you can add the component ID of each node as a property in the Neo4j database. You must specify the name of the new property using the writeProperty configuration parameter. The output will show a summary row with additional metrics, similar to the stats mode. Using the write mode allows you to save the results directly to the database.

CALL gds.wcc.write('myGraph', { writeProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount;