Synthetic Identity Fraud
1. Introduction
Synthetic identity theft is a type of fraud where a person combines real and fake identity information to open accounts or make purchases. Real information, such as photos or passport numbers, is often stolen and combined with fake information like a fake name, address, or date of birth. The fraudsters then use this synthetic identity to open credit cards or make fraudulent purchases. This method is sometimes referred to as "Frankenstein identity" due to the pieced-together nature of the identity information.
2. Scenario
Identity fraud is a constantly evolving and widespread form of fraud that poses significant risks to individuals, businesses, and the economy as a whole. The consequences of identity fraud can be devastating, resulting in financial losses, damage to credit history, and the difficult process of restoring a stolen identity.
This type of fraud is becoming increasingly important to catch as any proceeds that are taken from a customer’s account with new legislation being discussed will put additional financial protection for victims, with a liability split between the sending and receiving institutions.
Identity fraud is a pervasive and significant threat that requires banks' attention and proactive measures. By effectively mitigating identity fraud, banks not only protect their customers and maintain their trust but also contribute to the overall security and stability of the financial system.
3. Solution
Neo4j is an advanced graph database that is highly effective in detecting synthetic identities. It can model intricate relationships, conduct proficient link analysis, and recognise patterns. This makes it an ideal solution for quickly and easily uncovering suspicious connections and behaviours that are characteristic of synthetic identity fraud.
3.1. How Graph Databases Can Help?
Implementing Neo4j can help execute analysis that was not previously possible. Examples of the scenarios are:
-
Link Analysis: Neo4j enables efficient link analysis to explore connections among identities, analysing interrelationships between data points to identify synthetic identity clusters.
-
Pattern Detection: Neo4j can identify patterns without a starting point, making it ideal for analysing data shapes rather than the data itself.
-
Real-time: Neo4j enables real-time analysis for quickly identifying and responding to synthetic identity fraud. By continuously processing data, graph databases can detect suspicious activities and generate alerts in near real-time.
4. Modelling
This section will show examples of cypher queries on an example graph. The intention is to illustrate what the queries look like and provide a guide on how to structure your data in a real setting. We will do this on a small graph of several nodes. The example graph will be based on the data model below:
4.1. Data Model
4.1.1 Required Fields
Below are the fields required to get started:
Customer
Node:
-
name
: Contains the name of the customer. This could be changed for the customer id or some other identifier for the customer.
Passport
Node:
-
number
: Contains the passport number used byCustomer
Phone
Node:
-
number
: Contains the phone number used byCustomer
Email
Node:
-
email
: Contains the email address used byCustomer
4.2. Demo Data
The following Cypher statement will create the example graph in the Neo4j database:
// Create all people nodes
CREATE (c1:Customer {name: "Michael"})
CREATE (c2:Customer {name: "Adam"})
CREATE (c3:Customer {name: "Alice"})
// Create email node
CREATE (email:Email {email: "michael@abc.com"})
// Create phone node
CREATE (phone:Phone {number: 7971020304})
// Create passport node
CREATE (passport:Passport {number: 123456789})
// Create email relationship
CREATE (c1)-[:HAS_EMAIL]->(email)<-[:HAS_EMAIL]-(c2)
// Create phone relationship
CREATE (c2)-[:HAS_PHONE]->(phone)<-[:HAS_PHONE]-(c3)
// Create passport relationship
CREATE (c3)-[:HAS_PASSPORT]->(passport)<-[:HAS_PASSPORT]-(c1)
5. Cypher Queries
5.1. Identify Customers sharing the same email:
In this query, we will identify anyone who shares the same email address:
-
Customer
nodes should be connected to the sameEmail
node. -
The direction of the relationship has been applied to the traversal.
You can express the same query in a couple of different ways:
// Match all customers sharing an email
MATCH path=(c1:Customer)-[:HAS_EMAIL]->(email)<-[:HAS_EMAIL]-(c2:Customer)
RETURN path
Here you can see that we have provided the labels for the Customer
node and specified the exact relationship to follow. You could also get the same results with:
// Match all customers sharing an email
MATCH path=(c1:Customer)-[]->(email:Email)<-[]-(c2:Customer)
RETURN path
The difference here is that this time we have not specified the relationship type to follow, but because we have specified the Email
node label, as only one relationship leads to the Email
node, we get the same response. If your graph contains multiple relationships connecting a customer to an email, then this query will give you incorrect results.
If you were to provide all labels on nodes and relationships like the query below, you guarantee the correct traversal and ensure you do not get any incorrect results.
// Match all people sharing an email
MATCH path=(c1:Customer)-[:HAS_EMAIL]->(:Email)<-[:HAS_EMAIL]-(c2:Customer)
RETURN path
5.2. Identify customers sharing multiple characteristics:
In this query, we will identify any Customer
who shares the same email, phone or passport number with someone else:
-
Customer
nodes should be connected to the sameEmail
node. -
Customer
nodes should be connected to the samePhone
node. -
Customer
nodes should be connected to the samePassport
node. -
The direction of the relationship has been applied to the traversal.
// Match all customers sharing an email, phone or passport number
MATCH path=(c1:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_PASSPORT]->(info)<-[:HAS_EMAIL|HAS_PHONE|HAS_PASSPORT]-(c2:Customer)
RETURN path
6. Graph Data Science (GDS)
6.1. Weakly Connected Components
The Weakly Connected Components (WCC) algorithm identifies groups of connected nodes in both directed and undirected graphs. Nodes are considered connected if there is a path between them, and a component is formed by all the nodes that are connected to each other.
The reason to use this algorithm is that it identifies clusters of connected nodes with similar attributes, such as Email
, Phone
, or Passport
. It generates a community ID that can be reused in future investigations, providing valuable insights into the data and its surrounding communities.
6.1.1 Create Monopartite Graph
The WCC algorithm can only be applied on monopartite graphs with only one node label. In our case, the node label will be Customer
. We must modify the graph to make the data compatible with the WCC algorithm. To do so, we can use the query below to establish a new relationship called LINKED
, which will be used by the algorithm.
// Match all customers sharing an email, phone or passport number
MATCH (c1:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_PASSPORT]->(info)<-[:HAS_EMAIL|HAS_PHONE|HAS_PASSPORT]-(c2:Customer)
WHERE ID(c1) > ID(c2)
CREATE (c1)-[:LINKED]->(c2)
The query above modifies the data model and updates it to appear as follows:
6.1.2 Graph Projection
To start running any Graph Data Science algorithm, you first need to project a part of the graph. This will enable you to analyse the data in the projection effectively.
CALL gds.graph.project(
// graph projection name
'myGraph',
// nodes to import into projection
'Customer',
// relationship to import into projection
'LINKED'
)
6.1.2 GDS Stream
When using the stream
execution mode, the algorithm will provide the component ID for every node. This allows for direct inspection of results or post-processing in Cypher, without any negative impact. By ordering the results, nodes belonging to the same component can be displayed together for easier analysis.
CALL gds.wcc.stream('myGraph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).name AS name, componentId
ORDER BY componentId, name
6.1.3 GDS Write
By using the "write" execution mode, you can add the component ID of each node as a property in the Neo4j database. You must specify the name of the new property using the writeProperty
configuration parameter. The output will show a summary row with additional metrics, similar to the stats
mode. Using the write
mode allows you to save the results directly to the database.
CALL gds.wcc.write('myGraph', { writeProperty: 'componentId' })
YIELD nodePropertiesWritten, componentCount;