Exploring supervised entity resolution in Neo4j

AI Research Engineer, Neo4j

November 5, 2021

26 min read

While Supervised Entity Resolution (ER) can be immensely valuable, it is sometimes difficult to apply and scale in the real-world enterprise setting.

In this post, I explore how the Neo4j Graph Data Science (GDS) library can be applied to rapidly develop supervised ML pipelines for ER and walk through an example to demonstrate how it could be applied to your own data.

GDS is designed to be intuitive and easy to use, reducing friction for graph analytics so that junior and senior data scientists alike can make rapid progress, scale, and deliver value quickly.

We will start this post with a quick overview of Entity Resolution (ER), talk about ER in Graph, then dive into an example where we will use GDS Link Prediction Pipelines to train an entity linkage model and predict new entity links in the graph.

From there we will go over quick procedures for generating resolved entity ids and query resolved entities out of the graph. At the end, I will link you to follow-up resources, including a GitHub repository that should have everything you need to reproduce the example.

Note: This blog was initially written with GDS version 1.7. While the workflow for the latest GDS 2.x should be the same, some of the procedure names and syntax have changed. If you are using GDS 2.x, the GitHub repository contains a compatible version of the code. Please see the Readme to locate it.

What is Entity Resolution (ER)?

Entity Resolution (ER) is the process of disambiguating data to determine if multiple records actually represent the same real world entity such as a person, organization, place, or other type of object.

For example, say you have information on persons coming from different e-commerce platforms. They may have slightly different contact information, with addresses formatted differently, using different forms/abbreviations of names, etc.

A human may be able to tell if the records actually belong to the same underlying entity, but given the number of possible combinations and matching that can be had, there is a need for an intelligent automated approach to doing so, which is where ER systems come into play.

Why is ER important for data science?

We see application for ER across industries, including Online Advertising, Marketing, Retail, Anti-Money Laundering, and Law Enforcement.

Oftentimes our analysis requires integrating multiple data sources together, and we aren’t always in control of the quality and consistency of those data sources. For example, integrating additional e-commerce information with Customer Relationship Management (CRM) data can provide better understanding of a customers preferences, which could result in better user experiences, product recommendation, and ultimately sales revenue. Being able to correctly identify and merge entities will be key to providing actionable insights and business value — otherwise any modeling and analysis we do on users could be incomplete and suffer from poor accuracy and biases.

In other situations we may have just have one data source where identities are purely transient and/or severely obfuscated. One example of this is web browsing activity, where users may be browsing on the go, potentially on multiple devices without a traceable persistent id from account login.

Without some sort of ER solution, an online advertiser would have a limited and fragmented view of user activity and preferences which could significantly hinder their ability to accurately target and drive engagement.

ER in graph

Graphs are well suited for many ER problems because they can represent associated information between subjects with paths made up of nodes and relationships. The resulting connectivity between subjects can be used to make entity resolution inferences.

If we believe two subjects actually represent the same underlying entity, we can represent it with a relationship in the graph. I will refer to this pairwise matching as “Entity Linking” and label it with a “SAME_AS” relationship type.

In cases where we believe more than two subjects belong to the same entity, we can draw multiple entity linkage relationships. This forms small connected subgraphs, or components, where each disjoint component represents a single underlying entity.

Assigning a unique key to each component gives us a resolved entity id that we can then use to query aggregate resolved entity information. It is not 100% necessary in every circumstance, but it is a helpful id to have if you want to move information to downstream processes outside the graph.

It can be helpful to think about this ER workflow in these two distinct steps:

Linking step, where you do pairwise entity linking, and the
Resolution step, where you identify resolved entities via the linkages so downstream processes can easily access information aggregated by resolved entity.

The mechanism used for the resolution step is conceptually straightforward, though not always easy to implement depending on the underlying system. In our example below, we will go over an algorithm called Weakly Connected Components (WCC) that can be used to help create the resolved entity ids. Luckily, Neo4j and the Graph Data Science (GDS) Library are extremely well optimized to perform this step at scale.

The Mechanism used for Entity Linking, on the other hand, can vary greatly by use case and project maturity. In some cases, the relationships that determine entity linkages will be known and provided ahead of time, allowing you to jump directly to the resolution step. Other times, there may be a set of deterministic business rules governing the process, where entity linkages are determined based on matching known connection patterns in the graph. This can be a common starting point for many ER projects and it can work well for simpler, more deterministic, ER problems. In these cases, queries can be used to scan the graph and determine where entity linkages should be inserted. In yet other cases where it is difficult to determine a known set of business rules, similarity algorithms, such as K-Nearest Neighbor (KNN) or Jaccard based Node Similarity, can be used to score potential pairwise entity linkages in an unsupervised manner. Depending on the techniques used, this can leverage a combination of both relationships and node attribute information.

If you are fortunate enough to have a set of pre-identified, known, entity linkages, perhaps manually identified by an analyst or subject matter expert, then you also have the option of taking a supervised approach, which can be extremely powerful. This is what we are going to talk about today. In graph, we can use Link Prediction to solve this problem, which we will cover below. Like similarity scoring, this can leverage both relationship and node attribute information.

If you are interested in more ER use cases for Neo4j outside of Link Prediction, I encourage you to review this use case paper which go over these at a high level.

As with most things in Data Science and ML, there are many other variants of this approach that one can take for ER and deduplicating entities, many of which involve merging subjects together or creating new resolved entity objects. However, the pattern explained above can be a great place to start due its simplicity, maintainability, and separation from the original source data. Because original nodes are never merged/deduplicated and new nodes are never created, you can be agile from a Data science perspective, easily updating, backtracking, redoing, or dumping your ER analysis as needed without having to worry about reversing changes to original subject data or costly mutations to resolved entity objects.

Cross device entity linking example

To demonstrate, we will use a codalab challenge dataset focused on cross device entity linking. The dataset contains a browsing log for a set of anonymized userIDs at the level of devices and corresponding click-streams. User clicks are linked to site URLs and HTML titles. They also release a subset of known existing entity linkages between userIds, which is intended for use in supervised learning. The challenge is to predict new entity linkages, effectively resolving the same user across multiple devices. This type of entity linking is useful for marketing and online advertising since generating more complete user profiles can enable better targeting. As explained in the challenge, persistent user ids are not always available, so companies often rely on these weaker device level or browser cookie ids. If you are interested you can find more details about the dataset and challenge here.

One interesting thing about this dataset is that it does not have any user identifiers outside the transient userId. I think this is actually somewhat unique for an ER problem, usually you have some other identifiers, like usernames, email addresses, etc. In this case, we will need to train a model to predict underlying person identities just based on user browsing behavior.

Importing into Neo4j graph

For this example I am just using 10% of the training set provided from the competition. I recommend keeping it around there for experimentation on the average laptop. That said, I haven’t had any issues loading in more of the data so long as the hardware limitations on your machine, particularly RAM, can handle the additional data. If you are interested in how to optimize GDS for analyses like these on your machine, please see the GDS system requirements for a high level, particularly around setting heap space. The GDS configuration guide covers this in even deeper detail.

There are a couple different ways to construct a graph out of this data, and in fact, if you are interested in these sorts of problems, I would encourage you to try other schemas. But for sake of a quick example, we will go with the following, just leveraging the URLs (not even touching the HTML titles):

There are two node labels, a User for each unique userId, and a Website for each unique URL. A VISITED relationship represents a user event with the URL. The SAME_AS relationships represent the provided known entity linkages between userIds.

Web hierarchy and CHILD_OF relationships

I am also making a CHILD_OF relationship to capture the hierarchical structure of the Websites. In the Codalab dataset, the words within each URL segment and query string are hashed to preserve anonymity, but the ordering is preserved. Below is an example — notice the forward slashes and question mark denoting the path segments and query string respectively:

Example from https://competitions.codalab.org/competitions/11171

As a result, if two URLs share a common domain or partial path pre-anonymization, they will also share it post-anonymization. This allows us to create a hierarchical structure like this:

This structure reflects how web pages relate together in the real world, with high-level website domains and other sub pages underneath those. This structure adds richer interconnected context to the graph which will further help our model correlate user behavior.

Exploring the graph

I will spare you much of the initial EDA, but here are a couple quick queries of the graph just to give you an idea for what things look like.

Aggregate Tallies: Loading approximately 10% of the CodaLab dataset, we get almost 2.5 million total nodes and a bit over 5 million total relationships. The great majority of the nodes are Websites with only ~34k Users. The relationships are split mostly between ~2.8M VISTED and ~2.3M CHILD_OF relationships, with only ~5K SAME_AS relationships.

$ CALL apoc.meta.stats() YIELD stats RETURN stats.nodeCount AS nodeCount, stats.relCount as RelationshipCount;

╒═══════════╤═══════════════════╕
│"nodeCount"│"RelationshipCount"│
╞═══════════╪═══════════════════╡
│2433861    │5176804            │
└───────────┴───────────────────┘

$ CALL apoc.meta.stats() YIELD labels RETURN labels.User AS userCount, labels.Website AS WebsiteCount;

╒═══════════╤══════════════╕
│"userCount"│"WebsiteCount"│
╞═══════════╪══════════════╡
│33941      │2399920       │
└───────────┴──────────────┘

$ CALL apoc.meta.stats() YIELD relTypesCount RETURN relTypesCount.VISITED AS `VISITED Count`, relTypesCount.CHILD_OF AS `CHILD_OF Count`, relTypesCount.SAME_AS AS `SAME_AS Count`;

╒═══════════════╤════════════════╤═══════════════╕
│"VISITED Count"│"CHILD_OF Count"│"SAME_AS Count"│
╞═══════════════╪════════════════╪═══════════════╡
│2841391        │2330404         │5009           │
└───────────────┴────────────────┴───────────────┘

Connected Browsing Activity: Users can be connected through browsing activities in multiple ways. They can be connected directly by visiting the exact same URL/Website but also in multi-hop ways by visiting websites linked through the hierarchical CHILD_OF relationships. Below is a good example:

MATCH (u)-[v:VISITED]->(w:Website) 
WHERE u.userId in ["1b13bc4a2b79448c462e4f3817e6d470", "f05fe874d83d346f4993ce5ea872f20b"]
RETURN u,w,v;

SAME AS Relationships: Sampling the same-as relationships it appears most components just consist of pairs of two userIds, with fewer components of three and more.

MATCH (u1:User)-[r:SAME_AS]->(u2:User) RETURN u1,u2 limit 400;

We can calculate the component sizes globally across the graph using Cypher and the GDS Weakly Connected Components (WCC) algorithm. You will see that the majority of components consist of single userIds (no same-as relationships) with 2,995 (a little over 10%) having a size of 2 userIds, around 413 (~1%) with 3 userIds, and less than 1% of the components having more than that. The largest component has 8 userIds.

CALL gds.graph.create('same-as-subgraph','User', 'SAME_AS');

CALL gds.wcc.write('same-as-subgraph', {writeProperty: 'entityId'});

MATCH(u:User) 
WITH DISTINCT u.entityId as eid, count(*) as componentSize 
RETURN distinct componentSize, count(*) as numberOfComponents 
ORDER BY componentSize;

╒═══════════════╤════════════════════╕
│"componentSize"│"numberOfComponents"│
╞═══════════════╪════════════════════╡
│1              │26259               │
├───────────────┼────────────────────┤
│2              │2995                │
├───────────────┼────────────────────┤
│3              │413                 │
├───────────────┼────────────────────┤
│4              │81                  │
├───────────────┼────────────────────┤
│5              │18                  │
├───────────────┼────────────────────┤
│6              │4                   │
├───────────────┼────────────────────┤
│7              │1                   │
├───────────────┼────────────────────┤
│8              │1                   │
└───────────────┴────────────────────┘

Pre-processing & feature engineering

A couple steps before we create the the Pipeline:

First, we need to create a graph projection for the pipeline to operate on. We did this above really quickly when we ran the WCC algorithm. This time we are going to project the entire graph. Graph projections copy nodes and relationships from the database to an analytics environment optimized for global traversals where we can run algorithms and machine learning models quickly over the entire graph or large portions of it.

//create named graph - project entire graph with undirected relationships
CALL gds.graph.create(
  'er-projection',
  ['User', 'Website'],
  {
    SAME_AS: {
      type: 'SAME_AS',
      orientation: 'UNDIRECTED'
    },
    CHILD_OF: {
      type: 'CHILD_OF',
      orientation: 'UNDIRECTED'
    },
    VISITED: {
      type: 'VISITED',
      orientation: 'UNDIRECTED'
    }
  }
) YIELD nodeCount, relationshipCount, createMillis;


╒═══════════╤═══════════════════╤══════════════╕
│"nodeCount"│"relationshipCount"│"createMillis"│
╞═══════════╪═══════════════════╪══════════════╡
│2433861    │10353608           │931           │
└───────────┴───────────────────┴──────────────┘

When we specify orientation as ‘UNDIRECTED’ we effectively create an undirected graph for the projection. This distinction is important for upcoming algorithms, particularly for Link Prediction, which expects undirected relationships.

Next, we are going to do some feature engineering — specifically we are going to generate node embeddings. It is usually best to generate features inside the pipeline if you can, as the steps are automatically tracked with the data splitting and model evaluation. However, in this specific instance, we will be further filtering the graph during model training and the relationships and nodes we need will be filtered out.

We will use a Node Embedding technique called Fast Random Projection or “FastRP” for short. If you are interested in the technical details, please check out the documentation. To summarize for our intents and purposes, the job of FastRP is to take the complex interconnected structure of the user browsing activity and web hierarchy, and translate it into fixed length numeric vectors, one for each node in the graph. Executing FastRP successfully here means that users with similar browsing activity will have embedding vectors that are relatively close, while users with very different browsing activity will have relatively distant embeddings. You can imagine just how useful such features will be for model training.

Of the multiple node embeddings techniques available in GDS, FastRP is chosen here because it it is relatively easy to execute and also VERY performant, taking only seconds to complete for this dataset. I did adjust the iteration weights and dimensionality a bit from defaults in attempt to capture more of the rich graph structure. This was just done by hand (could probably be optimized further).

//create fastRP embeddings based on VISITED and CHILD_OF relationships
CALL gds.fastRP.mutate(
  'er-projection',
  {
    mutateProperty: 'embedding',
    relationshipTypes: ['CHILD_OF' , 'VISITED'],
    iterationWeights: [0.0, 1.0, 0.7, 0.5, 0.5, 0.4],
    embeddingDimension: 128,
    randomSeed: 7474
  }
) YIELD nodePropertiesWritten, computeMillis;

╒═══════════════════════╤═══════════════╕
│"nodePropertiesWritten"│"computeMillis"│
╞═══════════════════════╪═══════════════╡
│2433861                │11130          │
└───────────────────────┴───────────────┘

Configuring the LP pipeline

We will use Link prediction in GDS to accomplished supervised entity linking. As the name implies, Link Prediction is a machine learning technique to predict whether or not a relationship should exist between two nodes. A training example consists of input features generated from two nodes, with a binary class output representing whether or not a link exists between them.

GDS supports a Pipeline for Link Prediction. The Pipeline enables you to configure a series of steps for engineering features, data sampling and splitting, and hyper-parameters for training, in addition to procedures for prediction. It abstracts away many of the in-the-weeds technical challenges traditionally associated with link prediction, including sampling techniques for severe class imbalance and dealing with data leakage problems. As an ML-pipeline, it also saves your steps and the resulting trained model, making it easier to track and version your work.

To create the pipeline for our example we can execute the calls below:

Instantiate the pipeline

CALL gds.alpha.ml.pipeline.linkPrediction.create('er-pipe');

2. Calculate link features, specifically features that represent a measure of distance between the FastRP embeddings. As stated above, users with similar browsing activity should also have close embeddings, so the distance metric can provide a really good signal for our model. In this case I am adding two variants of distance, Least Squares (L2) and Cosine link feature combiners.

//add L2 link feature
CALL gds.alpha.ml.pipeline.linkPrediction.addFeature(
  'er-pipe',
  'l2', 
  {nodeProperties: ['embedding']}
);

//add cosine link feature
CALL gds.alpha.ml.pipeline.linkPrediction.addFeature(
  'er-pipe', 
  'cosine', 
  {nodeProperties: ['embedding']}
);

3. Configure data splitting.

CALL gds.alpha.ml.pipeline.linkPrediction.configureSplit(
  'er-pipe',
  {
    testFraction: 0.3,
    trainFraction: 0.7,
    negativeSamplingRatio: 10,
    validationFolds: 5
  }
);

We accomplish a couple things here. First, we decide on split ratio between training and test sets, just like a traditional ML workflow. A big difference to be aware of in this pipeline is the introduction of a third, feature input, set. If you allocate trainFraction=0.4 and testFraction=0.2&nbsp;, then the remaining 1 — 0.4 — 0.2 = 0.4 or 40% will be allocated to the feature input set. The Feature input set is used to avoid data leakage while generating node properties inside the pipeline, such as centrality, node embeddings, or other similar metrics. We will not really be leveraging it in this example (our embeddings do not leverage the SAME_AS relationships), but it is critical to use when you are generating node properties off the links you are trying to predict.

Second, we configure the negativeSamplingRatio. Negative examples in this context are any pairs of user nodes which do not already have a SAME_AS link between them. The negative sampling ratio dictates the rate at which these pairs are sampled relative to the known SAME_AS pairs during training. Link Prediction problems tend to be highly imbalanced with way more negative examples possible in the graph than positive ones — it is an O(n²) problem.

Neo4j’s recommended value for negativeSamplingRatio is the true class ratio of the graph. However, the higher the negative sample ratio is, the larger your dataset for training and testing becomes, which translates to longer evaluation time. So for larger graphs, a tradeoff exists between evaluation accuracy and speed, and I have personally found that setting the negative sampling rate to the true sample ratio is not always feasible as the graph grows in size.

In this example, the true class would be calculated as:

true_class_ratio = 
(q - Number_Of_SAME_AS_Rels) / Number_Of_SAME_AS_Rels 

where q = Number_Of_Users * (Number_Of_Users-1)/2

Which comes to almost 115,000. Here I chose negativeSamplingRatio=10 as it evaluates relatively quickly for purposes of running through an example and demonstrating how things work.

Hyper-parameters and training configuration. These cover many of the ordinary things you would expect for ML training — i.e. bias features, penalties, batch sizes, tolerance, epochs etc.. Each set of parameters below will correspond to a model. All of the models will be trained and the best performing one will be automatically selected. For this I am just adjusting the penalty between the three models.

//configure model parameters
CALL gds.alpha.ml.pipeline.linkPrediction.configureParams('er-pipe', [
  {
      penalty: 0.0,
      patience: 3,
      maxEpochs: 1000,
      tolerance: 0.00001
  },
  {
      penalty: 0.01,
      patience: 3,
      maxEpochs: 1000,
      tolerance: 0.00001
  },
    {
      penalty: 1.0,
      patience: 3,
      maxEpochs: 1000,
      tolerance: 0.00001
  }
]);

Training the LP model

When we run the training step, we filter the graph down to just Users and SAME_AS relationships, as this is what we are trying to predict.

//train the model 
CALL gds.alpha.ml.pipeline.linkPrediction.train( 'er-projection', {
    modelName: 'entity-linkage-model',
    pipeline: 'er-pipe',
    randomSeed: 7474,
    concurrency: 4,
    nodeLabels: ['User'],
    relationshipTypes: ['SAME_AS'],
    negativeClassWeight: 1.0/10.0
}) YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore,
  modelInfo.metrics.AUCPR.test AS testGraphScore;

╒═════════════════════════════════════════════════════════════════╕
│ "winningModel"            │ "trainGraphScore" │"testGraphScore" │
╞═════════════════════════════════════════════════════════════════╡
│ {..."penalty":1.0,...}    │ 0.937             │ 0.942           │
└─────────────────────────────────────────────────────────────────┘

The scores we get back are Area Under the Precision Recall Curve (AUCPR) scores. AUCPR is a metric bound between [0,1] which is particularly well-suited for scoring models on imbalanced data sets. In this case, we want to identify as many of the underlying entity links as possible while keeping the ratio of false positives low, and this is exactly what Recall and Precision measure respectively. Generally, the more entity linkages we want to identify, the wider net we need to cast, and the higher the ratio of false positives will become. This tradeoff tends to be expensive by default in severely imbalanced datasets where positive observations rarely occur. One way to think about AUCPR is that as it gets closer to 1, the tradeoff gets less expensive, i.e. we can identify more of the entity links with less of an increase in false positive rate.

It is critical to note that the way we set negativeSampleWeight has a big effect on the AUCPR score and it’s interpretation. Generally speaking, there are two approaches to setting the negativeSampleWeight

Traditional Evaluation Approach: Set the negativeSampleWeight to the reciprocal of the negativeSamplingRatio so that negativeSamplingRatio * negativeSampleWeight = 1 and the total probability mass in the test set is balanced 1-to-1. This is what we did above. In this case, AUCPR reflects how the model performs on a balanced dataset.
True Class Ratio Approach: Set negativeSampleWeight such that the total probability mass reflects the true class ratio, i.e. negativeSamplingRatio * negativeSampleWeight = trueClassRatio. This AUCPR can more accurately reflect how the model predicts on new unseen data, but the score will almost always be drastically smaller, so temper expectations.

Here is what the AUCPR looks like in our example using the second approach:

CALL gds.alpha.ml.pipeline.linkPrediction.train( 'er-projection', {
    modelName: 'entity-linkage-model-imb',
    pipeline: 'er-pipe',
    randomSeed: 7474,
    concurrency: 4,
    nodeLabels: ['User'],
    relationshipTypes: ['SAME_AS'],
    negativeClassWeight:11499
}) YIELD modelInfo
RETURN
  modelInfo.bestParameters AS winningModel,
  modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore,
  modelInfo.metrics.AUCPR.test AS testGraphScore;

╒═════════════════════════════════════════════════════════════════╕
│ "winningModel"           │ "trainGraphScore" │ "testGraphScore" │
╞═════════════════════════════════════════════════════════════════╡
│ {..."penalty":0.01,...}  │ 0.045             │ 0.052            │
└─────────────────────────────────────────────────────────────────┘

This lower score suggests that our model may lack precision when predicting overall node pairs in the graph (i.e. bring back a lot of false positives). While I won’t be tackling that in-depth in this post, there are multiple things we could adjust to try and improve this, including bringing in more of the source data, increasing the negative sampling rate, or tuning the feature engineering and/or hyper-parameters. That being said, some of these issues are difficult to avoid in ER and ultimately you need to analyze the business cost for false positives and false negatives within the context of your specific problem — i.e. how many false positives can be tolerated to identity a true entity linkage? And conversely, what is the cost of missing a true entity linkage? These costs may vary greatly between different business applications — for example, the cost of false positives may be significantly higher in marketing applications than in, say, anti-fraud.

Additionally, it is not always straightforward which of the weighting approaches is best to optimize for an ER problem. Often in ER, datasets are only partially labeled, i.e. there may be many unidentified entity linkages present in the current data. In this context, some lack of precision may not be that bad if it helps identify more of the unknown entity linkages and uncover patterns not yet fully understood by analysts. Again, it really depends on the problem domain and the use case at hand.

Creating entity linkage predictions

Once the model is trained we can use it to try and predict new entity linkages in the graph. In the below command we scan the graph for the top 20 most probable missing entity links (a.k.a — highest predicted probabilities according to the model) and create the predicted links in the projected graph. (Note: As all potential user pairs in the graph are scanned, this step may take a while to complete. For this example, it takes me ~5–10 minutes).

//make link predictions
CALL gds.alpha.ml.pipeline.linkPrediction.predict.mutate('er-projection', {
  modelName: 'entity-linkage-model',
  mutateRelationshipType: 'SAME_AS_PREDICTED',
  nodeLabels: ['User'],
  relationshipTypes: ['SAME_AS'],
  topN: 20,
  threshold: 0.0
});

Once the above is complete we can easily write the links back to the database with the predicted probabilities.

//write predicted relationships back to DB
CALL gds.graph.writeRelationship('er-projection', 'SAME_AS_PREDICTED', 'probability');

When we write undirected relationships back to the database we will get 2 relationships between each node, one for each direction. In this application we do not care about direction so these are effectively duplicates. We can deduplicate them with the below:

MATCH (u1:User)-[r:SAME_AS_PREDICTED]->(u2:User) 
WHERE id(u1) < id(u2) DELETE r

Now we can visualize the predicted entity linkages:

MATCH (u1:User)-[r:SAME_AS_PREDICTED]->(u2:User) RETURN u1,r,u2;

We can also see the sorted predicted probabilities for each link going from one user node to the other. User nodes with multiple predicted links will be listed multiple time in the output.

//show predicted entity links ordered by probability
MATCH (u1:User)-[r:SAME_AS_PREDICTED]->(u2:User) 
RETURN u1.userId AS user1, u2.userId AS user2, r.probability as entityLinkageProbability
ORDER BY entityLinkageProbability DESC

Generating resolved person IDs and querying resolved person info

We can generate resolved person ids with both the SAME_AS and SAME_AS_PREDICTED relationships using the Weakly Connected Components (WCC) algorithm. we will use the GDS write API to write these ids directly back to the database.

CALL gds.wcc.write(
  'er-projection', {
  nodeLabels: ['User'],
  writeProperty: 'personId'
  }
) YIELD componentCount, nodePropertiesWritten, writeMillis, computeMillis;

╒════════════════╤══════════════════╤═════════════╤═══════════════╕
│"componentCount"│"nodePropsWritten"│"writeMillis"│"computeMillis"│
╞════════════════╪══════════════════╪═════════════╪═══════════════╡
│29760           │33941             │37           │33             │
└────────────────┴──────────────────┴─────────────┴───────────────┘

We can then easily query and aggregate information by these ids. Cypher is extremely performant when it comes to relationship traversals, so oftentimes these queries can be replicated using just the SAME_AS* relationship types, but the id is still convenient to have. For example, in the below query we return aggregated browsing activity for all persons that had a newly predicted entity link.

//get all resolved persons who have a newly predicted entity link
MATCH (n:User)-[:SAME_AS_PREDICTED]-() 
WITH DISTINCT n.personId AS personIdList

//get all browsing activity for those persons
MATCH (n:User)-[r:VISITED]->(w:Website) 
WHERE n.personId IN personIdList

//return each person with userIds and summary browsing activity
WITH n.personId as personId, collect({
  website:w.url,
  firstTimeStamp: r.timeStamps[0],
  lastTimeStamp: last(r.timeStamps)}) AS webActivity, collect(DISTINCT n.userId) AS userIds
RETURN personId, webActivity, userIds ORDER BY personId DESC

And that will give results like those below, which includes a resolved person id, their aggregated browsing activity, and all their userIds:

[
  {
    "personId": 10644,
    "webActivity": [
      {
        "firstTimeStamp": "2016-06-06T22:10:53.779000000Z",
        "lastTimeStamp": "2016-06-06T22:11:23.779000000Z",
        "website": "bddd393cb02101a/203b292b93c374e5"
      },...
    ],
    "userIds": [
      "3005d862d7f6a60f2a19fa10a70dd24d",
      "3e4217f722986af322333746f5f132c0",
      "50e9121a2f9f30c83b071e9adf9951c3"
    ]
  },...
]

These sorts of queries can be incredibly useful for downstream processes that need comprehensive resolved views for persons, whether that be for further analytics, reporting, or a frontend application.

What’s next?

In this post, we demonstrated how we can rapidly develop supervised machine learning pipelines for entity linking, evaluate performance, make new entity linkage predictions, and quickly create resolved entity ids that allow downstream processes to pull resolved person information.

Here is a GitHub repository that should have everything needed to reproduce the example we just walked through, including a notebook for subsampling and formatting the source data, and a cypher script for ingesting into Neo4j.

If you are new to Neo4j, I recommend using Neo4j Desktop with GDS for this example. Please use GDS version 1.7.2 or higher for compatibility.

As stated above, if you want to get serious with this, it is probably a good idea to ingest more of the data and fine-tune the feature engineering, sampling, and hyper-parameter configuration. Please reach out and let me know how it goes for you or if you have questions related to Entity Resolution in Neo4j!

Exploring Supervised Entity Resolution in Neo4j was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.