Exploring Supervised Entity Resolution in Neo4j

Data Science Product Specialist, Neo4j
20 min read

What is Entity Resolution (ER)?
Entity Resolution (ER) is the process of disambiguating data to determine if multiple records actually represent the same real world entity such as a person, organization, place, or other type of object.For example, say you have information on persons coming from different e-commerce platforms. They may have slightly different contact information, with addresses formatted differently, using different forms/abbreviations of names, etc. A human may be able to tell if the records actually belong to the same underlying entity, but given the number of possible combinations and matching that can be had, there is a need for an intelligent automated approach to doing so, which is where ER systems come into play.
Why is ER Important for Data Science?
We see application for ER across industries, including Online Advertising, Marketing, Retail, Anti-Money Laundering, and Law Enforcement. Oftentimes our analysis requires integrating multiple data sources together, and we aren’t always in control of the quality and consistency of those data sources. For example, integrating additional e-commerce information with Customer Relationship Management (CRM) data can provide better understanding of a customers preferences, which could result in better user experiences, product recommendation, and ultimately sales revenue. Being able to correctly identify and merge entities will be key to providing actionable insights and business value — otherwise any modeling and analysis we do on users could be incomplete and suffer from poor accuracy and biases. In other situations we may have just have one data source where identities are purely transient and/or severely obfuscated. One example of this is web browsing activity, where users may be browsing on the go, potentially on multiple devices without a traceable persistent id from account login. Without some sort of ER solution, an online advertiser would have a limited and fragmented view of user activity and preferences which could significantly hinder their ability to accurately target and drive engagement.ER in Graph
Graphs are well suited for many ER problems because they can represent associated information between subjects with paths made up of nodes and relationships. The resulting connectivity between subjects can be used to make entity resolution inferences.


- Linking step, where you do pairwise entity linking, and the
- Resolution step, where you identify resolved entities via the linkages so downstream processes can easily access information aggregated by resolved entity.
Cross Device Entity Linking Example
To demonstrate, we will use a codalab challenge dataset focused on cross device entity linking. The dataset contains a browsing log for a set of anonymized userIDs at the level of devices and corresponding click-streams. User clicks are linked to site URLs and HTML titles. They also release a subset of known existing entity linkages between userIds, which is intended for use in supervised learning. The challenge is to predict new entity linkages, effectively resolving the same user across multiple devices. This type of entity linking is useful for marketing and online advertising since generating more complete user profiles can enable better targeting. As explained in the challenge, persistent user ids are not always available, so companies often rely on these weaker device level or browser cookie ids. If you are interested you can find more details about the dataset and challenge here. One interesting thing about this dataset is that it does not have any user identifiers outside the transient userId. I think this is actually somewhat unique for an ER problem, usually you have some other identifiers, like usernames, email addresses, etc. In this case, we will need to train a model to predict underlying person identities just based on user browsing behavior.Importing Into Neo4j Graph
For this example I am just using 10% of the training set provided from the competition. I recommend keeping it around there for experimentation on the average laptop. That said, I haven’t had any issues loading in more of the data so long as the hardware limitations on your machine, particularly RAM, can handle the additional data. If you are interested in how to optimize GDS for analyses like these on your machine, please see the GDS system requirements for a high level, particularly around setting heap space. The GDS configuration guide covers this in even deeper detail. There are a couple different ways to construct a graph out of this data, and in fact, if you are interested in these sorts of problems, I would encourage you to try other schemas. But for sake of a quick example, we will go with the following, just leveraging the URLs (not even touching the HTML titles):
Web Hierarchy and CHILD_OF Relationships
I am also making a CHILD_OF relationship to capture the hierarchical structure of the Websites. In the Codalab dataset, the words within each URL segment and query string are hashed to preserve anonymity, but the ordering is preserved. Below is an example — notice the forward slashes and question mark denoting the path segments and query string respectively:

Exploring the Graph
I will spare you much of the initial EDA, but here are a couple quick queries of the graph just to give you an idea for what things look like.- Aggregate Tallies: Loading approximately 10% of the CodaLab dataset, we get almost 2.5 million total nodes and a bit over 5 million total relationships. The great majority of the nodes are Websites with only ~34k Users. The relationships are split mostly between ~2.8M VISTED and ~2.3M CHILD_OF relationships, with only ~5K SAME_AS relationships.
$ CALL apoc.meta.stats() YIELD stats RETURN stats.nodeCount AS nodeCount, stats.relCount as RelationshipCount; ╒═══════════╤═══════════════════╕ │"nodeCount"│"RelationshipCount"│ ╞═══════════╪═══════════════════╡ │2433861 │5176804 │ └───────────┴───────────────────┘ $ CALL apoc.meta.stats() YIELD labels RETURN labels.User AS userCount, labels.Website AS WebsiteCount; ╒═══════════╤══════════════╕ │"userCount"│"WebsiteCount"│ ╞═══════════╪══════════════╡ │33941 │2399920 │ └───────────┴──────────────┘ $ CALL apoc.meta.stats() YIELD relTypesCount RETURN relTypesCount.VISITED AS `VISITED Count`, relTypesCount.CHILD_OF AS `CHILD_OF Count`, relTypesCount.SAME_AS AS `SAME_AS Count`; ╒═══════════════╤════════════════╤═══════════════╕ │"VISITED Count"│"CHILD_OF Count"│"SAME_AS Count"│ ╞═══════════════╪════════════════╪═══════════════╡ │2841391 │2330404 │5009 │ └───────────────┴────────────────┴───────────────┘
- Connected Browsing Activity: Users can be connected through browsing activities in multiple ways. They can be connected directly by visiting the exact same URL/Website but also in multi-hop ways by visiting websites linked through the hierarchical CHILD_OF relationships. Below is a good example:
MATCH (u)-[v:VISITED]->(w:Website) WHERE u.userId in ["1b13bc4a2b79448c462e4f3817e6d470", "f05fe874d83d346f4993ce5ea872f20b"] RETURN u,w,v;

- SAME AS Relationships: Sampling the same-as relationships it appears most components just consist of pairs of two userIds, with fewer components of three and more.
MATCH (u1:User)-[r:SAME_AS]->(u2:User) RETURN u1,u2 limit 400;

CALL gds.graph.create('same-as-subgraph','User', 'SAME_AS'); CALL gds.wcc.write('same-as-subgraph', {writeProperty: 'entityId'}); MATCH(u:User) WITH DISTINCT u.entityId as eid, count(*) as componentSize RETURN distinct componentSize, count(*) as numberOfComponents ORDER BY componentSize; ╒═══════════════╤════════════════════╕ │"componentSize"│"numberOfComponents"│ ╞═══════════════╪════════════════════╡ │1 │26259 │ ├───────────────┼────────────────────┤ │2 │2995 │ ├───────────────┼────────────────────┤ │3 │413 │ ├───────────────┼────────────────────┤ │4 │81 │ ├───────────────┼────────────────────┤ │5 │18 │ ├───────────────┼────────────────────┤ │6 │4 │ ├───────────────┼────────────────────┤ │7 │1 │ ├───────────────┼────────────────────┤ │8 │1 │ └───────────────┴────────────────────┘
Pre-Processing & Feature Engineering
A couple steps before we create the the Pipeline: First, we need to create a graph projection for the pipeline to operate on. We did this above really quickly when we ran the WCC algorithm. This time we are going to project the entire graph. Graph projections copy nodes and relationships from the database to an analytics environment optimized for global traversals where we can run algorithms and machine learning models quickly over the entire graph or large portions of it.//create named graph - project entire graph with undirected relationships CALL gds.graph.create( 'er-projection', ['User', 'Website'], { SAME_AS: { type: 'SAME_AS', orientation: 'UNDIRECTED' }, CHILD_OF: { type: 'CHILD_OF', orientation: 'UNDIRECTED' }, VISITED: { type: 'VISITED', orientation: 'UNDIRECTED' } } ) YIELD nodeCount, relationshipCount, createMillis; ╒═══════════╤═══════════════════╤══════════════╕ │"nodeCount"│"relationshipCount"│"createMillis"│ ╞═══════════╪═══════════════════╪══════════════╡ │2433861 │10353608 │931 │ └───────────┴───────────────────┴──────────────┘When we specify orientation as ‘UNDIRECTED’ we effectively create an undirected graph for the projection. This distinction is important for upcoming algorithms, particularly for Link Prediction, which expects undirected relationships. Next, we are going to do some feature engineering — specifically we are going to generate node embeddings. It is usually best to generate features inside the pipeline if you can, as the steps are automatically tracked with the data splitting and model evaluation. However, in this specific instance, we will be further filtering the graph during model training and the relationships and nodes we need will be filtered out. We will use a Node Embedding technique called Fast Random Projection or “FastRP” for short. If you are interested in the technical details, please check out the documentation. To summarize for our intents and purposes, the job of FastRP is to take the complex interconnected structure of the user browsing activity and web hierarchy, and translate it into fixed length numeric vectors, one for each node in the graph. Executing FastRP successfully here means that users with similar browsing activity will have embedding vectors that are relatively close, while users with very different browsing activity will have relatively distant embeddings. You can imagine just how useful such features will be for model training. Of the multiple node embeddings techniques available in GDS, FastRP is chosen here because it it is relatively easy to execute and also VERY performant, taking only seconds to complete for this dataset. I did adjust the iteration weights and dimensionality a bit from defaults in attempt to capture more of the rich graph structure. This was just done by hand (could probably be optimized further).
//create fastRP embeddings based on VISITED and CHILD_OF relationships CALL gds.fastRP.mutate( 'er-projection', { mutateProperty: 'embedding', relationshipTypes: ['CHILD_OF' , 'VISITED'], iterationWeights: [0.0, 1.0, 0.7, 0.5, 0.5, 0.4], embeddingDimension: 128, randomSeed: 7474 } ) YIELD nodePropertiesWritten, computeMillis; ╒═══════════════════════╤═══════════════╕ │"nodePropertiesWritten"│"computeMillis"│ ╞═══════════════════════╪═══════════════╡ │2433861 │11130 │ └───────────────────────┴───────────────┘
Configuring the Link Prediction (LP) Pipeline
We will use Link prediction in GDS to accomplished supervised entity linking. As the name implies, Link Prediction is a machine learning technique to predict whether or not a relationship should exist between two nodes. A training example consists of input features generated from two nodes, with a binary class output representing whether or not a link exists between them.

- Instantiate the pipeline
CALL gds.alpha.ml.pipeline.linkPrediction.create('er-pipe');2. Calculate link features, specifically features that represent a measure of distance between the FastRP embeddings. As stated above, users with similar browsing activity should also have close embeddings, so the distance metric can provide a really good signal for our model. In this case I am adding two variants of distance, Least Squares (L2) and Cosine link feature combiners.
//add L2 link feature CALL gds.alpha.ml.pipeline.linkPrediction.addFeature( 'er-pipe', 'l2', {nodeProperties: ['embedding']} ); //add cosine link feature CALL gds.alpha.ml.pipeline.linkPrediction.addFeature( 'er-pipe', 'cosine', {nodeProperties: ['embedding']} );3. Configure data splitting.
CALL gds.alpha.ml.pipeline.linkPrediction.configureSplit( 'er-pipe', { testFraction: 0.3, trainFraction: 0.7, negativeSamplingRatio: 10, validationFolds: 5 } );We accomplish a couple things here. First, we decide on split ratio between training and test sets, just like a traditional ML workflow. A big difference to be aware of in this pipeline is the introduction of a third,
feature input
, set. If you allocate trainFraction=0.4
and testFraction=0.2 
;, then the remaining 1 — 0.4 — 0.2 = 0.4
or 40% will be allocated to the feature input set. The Feature input set is used to avoid data leakage while generating node properties inside the pipeline, such as centrality, node embeddings, or other similar metrics. We will not really be leveraging it in this example (our embeddings do not leverage the SAME_AS relationships), but it is critical to use when you are generating node properties off the links you are trying to predict.
Second, we configure the negativeSamplingRatio
. Negative examples in this context are any pairs of user nodes which do not already have a SAME_AS link between them. The negative sampling ratio dictates the rate at which these pairs are sampled relative to the known SAME_AS pairs during training. Link Prediction problems tend to be highly imbalanced with way more negative examples possible in the graph than positive ones — it is an O(n²) problem.
Neo4j’s recommended value for negativeSamplingRatio is the true class ratio of the graph. However, the higher the negative sample ratio is, the larger your dataset for training and testing becomes, which translates to longer evaluation time. So for larger graphs, a tradeoff exists between evaluation accuracy and speed, and I have personally found that setting the negative sampling rate to the true sample ratio is not always feasible as the graph grows in size.
In this example, the true class would be calculated as:
true_class_ratio = (q - Number_Of_SAME_AS_Rels) / Number_Of_SAME_AS_Rels where q = Number_Of_Users * (Number_Of_Users-1)/2Which comes to almost 115,000. Here I chose
negativeSamplingRatio=10
as it evaluates relatively quickly for purposes of running through an example and demonstrating how things work.
- Hyper-parameters and training configuration. These cover many of the ordinary things you would expect for ML training — i.e. bias features, penalties, batch sizes, tolerance, epochs etc.. Each set of parameters below will correspond to a model. All of the models will be trained and the best performing one will be automatically selected. For this I am just adjusting the penalty between the three models.
//configure model parameters CALL gds.alpha.ml.pipeline.linkPrediction.configureParams('er-pipe', [ { penalty: 0.0, patience: 3, maxEpochs: 1000, tolerance: 0.00001 }, { penalty: 0.01, patience: 3, maxEpochs: 1000, tolerance: 0.00001 }, { penalty: 1.0, patience: 3, maxEpochs: 1000, tolerance: 0.00001 } ]);
Training the LP Model
When we run the training step, we filter the graph down to just Users and SAME_AS relationships, as this is what we are trying to predict.//train the model CALL gds.alpha.ml.pipeline.linkPrediction.train( 'er-projection', { modelName: 'entity-linkage-model', pipeline: 'er-pipe', randomSeed: 7474, concurrency: 4, nodeLabels: ['User'], relationshipTypes: ['SAME_AS'], negativeClassWeight: 1.0/10.0 }) YIELD modelInfo RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore; ╒═════════════════════════════════════════════════════════════════╕ │ "winningModel" │ "trainGraphScore" │"testGraphScore" │ ╞═════════════════════════════════════════════════════════════════╡ │ {..."penalty":1.0,...} │ 0.937 │ 0.942 │ └─────────────────────────────────────────────────────────────────┘The scores we get back are Area Under the Precision Recall Curve (AUCPR) scores. AUCPR is a metric bound between [0,1] which is particularly well-suited for scoring models on imbalanced data sets. In this case, we want to identify as many of the underlying entity links as possible while keeping the ratio of false positives low, and this is exactly what Recall and Precision measure respectively. Generally, the more entity linkages we want to identify, the wider net we need to cast, and the higher the ratio of false positives will become. This tradeoff tends to be expensive by default in severely imbalanced datasets where positive observations rarely occur. One way to think about AUCPR is that as it gets closer to 1, the tradeoff gets less expensive, i.e. we can identify more of the entity links with less of an increase in false positive rate. It is critical to note that the way we set
negativeSampleWeight
has a big effect on the AUCPR score and it’s interpretation. Generally speaking, there are two approaches to setting the negativeSampleWeight
- Traditional Evaluation Approach: Set the
negativeSampleWeight
to the reciprocal of thenegativeSamplingRatio
so thatnegativeSamplingRatio * negativeSampleWeight = 1
and the total probability mass in the test set is balanced 1-to-1. This is what we did above. In this case, AUCPR reflects how the model performs on a balanced dataset. - True Class Ratio Approach: Set
negativeSampleWeight
such that the total probability mass reflects the true class ratio, i.e.negativeSamplingRatio * negativeSampleWeight = trueClassRatio
. This AUCPR can more accurately reflect how the model predicts on new unseen data, but the score will almost always be drastically smaller, so temper expectations.
CALL gds.alpha.ml.pipeline.linkPrediction.train( 'er-projection', { modelName: 'entity-linkage-model-imb', pipeline: 'er-pipe', randomSeed: 7474, concurrency: 4, nodeLabels: ['User'], relationshipTypes: ['SAME_AS'], negativeClassWeight:11499 }) YIELD modelInfo RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore; ╒═════════════════════════════════════════════════════════════════╕ │ "winningModel" │ "trainGraphScore" │ "testGraphScore" │ ╞═════════════════════════════════════════════════════════════════╡ │ {..."penalty":0.01,...} │ 0.045 │ 0.052 │ └─────────────────────────────────────────────────────────────────┘This lower score suggests that our model may lack precision when predicting overall node pairs in the graph (i.e. bring back a lot of false positives). While I won’t be tackling that in-depth in this post, there are multiple things we could adjust to try and improve this, including bringing in more of the source data, increasing the negative sampling rate, or tuning the feature engineering and/or hyper-parameters. That being said, some of these issues are difficult to avoid in ER and ultimately you need to analyze the business cost for false positives and false negatives within the context of your specific problem — i.e. how many false positives can be tolerated to identity a true entity linkage? And conversely, what is the cost of missing a true entity linkage? These costs may vary greatly between different business applications — for example, the cost of false positives may be significantly higher in marketing applications than in, say, anti-fraud. Additionally, it is not always straightforward which of the weighting approaches is best to optimize for an ER problem. Often in ER, datasets are only partially labeled, i.e. there may be many unidentified entity linkages present in the current data. In this context, some lack of precision may not be that bad if it helps identify more of the unknown entity linkages and uncover patterns not yet fully understood by analysts. Again, it really depends on the problem domain and the use case at hand.
Creating Entity Linkage Predictions
Once the model is trained we can use it to try and predict new entity linkages in the graph. In the below command we scan the graph for the top 20 most probable missing entity links (a.k.a — highest predicted probabilities according to the model) and create the predicted links in the projected graph. (Note: As all potential user pairs in the graph are scanned, this step may take a while to complete. For this example, it takes me ~5–10 minutes).//make link predictions CALL gds.alpha.ml.pipeline.linkPrediction.predict.mutate('er-projection', { modelName: 'entity-linkage-model', mutateRelationshipType: 'SAME_AS_PREDICTED', nodeLabels: ['User'], relationshipTypes: ['SAME_AS'], topN: 20, threshold: 0.0 });Once the above is complete we can easily write the links back to the database with the predicted probabilities.
//write predicted relationships back to DB CALL gds.graph.writeRelationship('er-projection', 'SAME_AS_PREDICTED', 'probability');When we write undirected relationships back to the database we will get 2 relationships between each node, one for each direction. In this application we do not care about direction so these are effectively duplicates. We can deduplicate them with the below:
MATCH (u1:User)-[r:SAME_AS_PREDICTED]->(u2:User) WHERE id(u1) < id(u2) DELETE rNow we can visualize the predicted entity linkages:
MATCH (u1:User)-[r:SAME_AS_PREDICTED]->(u2:User) RETURN u1,r,u2;

//show predicted entity links ordered by probability MATCH (u1:User)-[r:SAME_AS_PREDICTED]->(u2:User) RETURN u1.userId AS user1, u2.userId AS user2, r.probability as entityLinkageProbability ORDER BY entityLinkageProbability DESC

Generating Resolved Person Ids and Querying Resolved Person Info
We can generate resolved person ids with both the SAME_AS and SAME_AS_PREDICTED relationships using the Weakly Connected Components (WCC) algorithm. we will use the GDS write API to write these ids directly back to the database.CALL gds.wcc.write( 'er-projection', { nodeLabels: ['User'], writeProperty: 'personId' } ) YIELD componentCount, nodePropertiesWritten, writeMillis, computeMillis; ╒════════════════╤══════════════════╤═════════════╤═══════════════╕ │"componentCount"│"nodePropsWritten"│"writeMillis"│"computeMillis"│ ╞════════════════╪══════════════════╪═════════════╪═══════════════╡ │29760 │33941 │37 │33 │ └────────────────┴──────────────────┴─────────────┴───────────────┘We can then easily query and aggregate information by these ids. Cypher is extremely performant when it comes to relationship traversals, so oftentimes these queries can be replicated using just the SAME_AS* relationship types, but the id is still convenient to have. For example, in the below query we return aggregated browsing activity for all persons that had a newly predicted entity link.
//get all resolved persons who have a newly predicted entity link MATCH (n:User)-[:SAME_AS_PREDICTED]-() WITH DISTINCT n.personId AS personIdList //get all browsing activity for those persons MATCH (n:User)-[r:VISITED]->(w:Website) WHERE n.personId IN personIdList //return each person with userIds and summary browsing activity WITH n.personId as personId, collect({ website:w.url, firstTimeStamp: r.timeStamps[0], lastTimeStamp: last(r.timeStamps)}) AS webActivity, collect(DISTINCT n.userId) AS userIds RETURN personId, webActivity, userIds ORDER BY personId DESCAnd that will give results like those below, which includes a resolved person id, their aggregated browsing activity, and all their userIds:
[ { "personId": 10644, "webActivity": [ { "firstTimeStamp": "2016-06-06T22:10:53.779000000Z", "lastTimeStamp": "2016-06-06T22:11:23.779000000Z", "website": "bddd393cb02101a/203b292b93c374e5" },... ], "userIds": [ "3005d862d7f6a60f2a19fa10a70dd24d", "3e4217f722986af322333746f5f132c0", "50e9121a2f9f30c83b071e9adf9951c3" ] },... ]These sorts of queries can be incredibly useful for downstream processes that need comprehensive resolved views for persons, whether that be for further analytics, reporting, or a frontend application.
What’s Next?
In this post, we demonstrated how we can rapidly develop supervised machine learning pipelines for entity linking, evaluate performance, make new entity linkage predictions, and quickly create resolved entity ids that allow downstream processes to pull resolved person information. Here is a GitHub repository that should have everything needed to reproduce the example we just walked through, including a notebook for subsampling and formatting the source data, and a cypher script for ingesting into Neo4j. If you are new to Neo4j, I recommend using Neo4j Desktop with GDS for this example. Please use GDS version 1.7.2 or higher for compatibility. As stated above, if you want to get serious with this, it is probably a good idea to ingest more of the data and fine-tune the feature engineering, sampling, and hyper-parameter configuration. Please reach out and let me know how it goes for you or if you have questions related to Entity Resolution in Neo4j!Exploring Supervised Entity Resolution in Neo4j was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.