Using graph structure record linkage on Irish census data with Neo4j

October 2, 2015

10 min read

See How We Used Graph Structure Record Linkage to Extract Insights on Irish Census Data with Neo4j

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911 Irish censuses, I hoped I would be able to find a way to reliably link resident records from the two together to identify the same residents.

Since then I’ve learned a bit about master data management and record linkage and so I thought I would give it another stab.

Here I’d like to talk about how I’ve been matching records based on the local data space around objects to improve my record linkage scoring.

The data model of the imported data is very linear:

In this post, however, I’m going to be focusing on Houses and Residents and creating relationships between them based on their properties.

Relations to the head

To view an example of what a census record from 1911 Ireland looks like you can have a look at the McCarthys of 1901 and 1911. Charles is the head of the family with his wife Hannah, mother Ellen, children (two in 1901 and seven in 1911), and a servant (Timothy Walsh in 1901 and William Regan in 1911).

The McCarthy family of Barnagowlane, Cloghdowell, Cork, 1901

The McCarthy family of Barnagowlane, Cloghdonnell, Cork, 1911

Surname	Forename	Age	Sex	Relation to Head	Religion	Surname	Forename	Age	Sex	Relation to Head	Religion
McCarthy	Charles	37	Male	Head of Family	Roman Catholic	McCarthy	Charles	47	Male	Head of Family	Roman Catholic
McCarthy	Hannah	25	Female	Wife	Roman Catholic	McCarthy	Hannah	35	Female	Wife	Roman Catholic
McCarthy	William	1	Male	Son	Roman Catholic	McCarthy	William	11	Male	Son	Roman Catholic
McCarthy	Bridget		Female	Daughter	Roman Catholic	McCarthy	Bridget	10	Female	Daughter	Roman Catholic
						McCarthy	Ellen	8	Female	Daughter	Roman Catholic
						McCarthy	Kate	6	Female	Daughter	Roman Catholic
						McCarthy	Florence	4	Male	Son	Roman Catholic
						McCarthy	Charles Peter	2	Male	Son	Roman Catholic
						McCarthy	Annie		Female	Daughter	Roman Catholic
McCarthy	Ellen	65	Female	Mother	Roman Catholic	McCarthy	? Ellen	75	Female	Mother	Roman Catholic
Walsh	Timothy	25	Male	Servant	Roman Catholic
						Regan	William	24	Male	Servant	Roman Catholic

The McCarthys are an almost exact match between two census records between 1901 and 1911. The names, ages, occupations, and relationships all match perfectly.

Unfortunately the story for other records is not so simple. Many times, houses – which to the human eye seem to be the same house – can have wildly varying details. For example, Hannah might go be listed as Hana or Anne in a different census.

Likewise, ages vary a lot more than you might think. In examining the records I regularly found ages varying by a year or two and have even found a few houses with ages off by as much as 10-15 years.

In both censuses, there is a field for residents to fill out called “Relation to Head.” This gives us information about how each resident is related to the head of the house. In the case of the McCarthys, Charles is listed as “Head of Family” in both years. The rest of the family has a nice representation of things that we often see in the data: “Wife,” “Son,” “Daughter,” and “Servant.”

We might be tempted to say, “This person was the head in 1901, so they must be the same person who was the head in 1911.” Often, however, the head of the family can die or retire leaving the role of head of the family to their wife or child.

So, can the “Relation to Head” values still be useful to us to match any given resident from 1901 to another resident in 1911?

First, let’s cover the general process of record linkage I have been using. To find a match for a resident, I start by using an Elasticsearch server (which contains a duplicate of my Neo4j census data) to quickly find a list of other residents with a match on very rough criteria:

Is the resident in the other census?
Does the sex match (or is it NULL)?
Is the resident’s age within 15 years of what it would be expected to be in the other census?
Does the name match, roughly (within an edit distance of 4)?

This comes back with anywhere from zero to hundreds of results. I call these “similarity candidates” and for each one, I create a relationship between the original record and the candidate.

With this list, I can compare the attributes of the two records (using the record_linkage gem I created) to see how closely they match.

The closer their name, sex, age, etc. matches, the higher score they get. Ideally, the real match should have the highest score, but that isn’t always true and can take some tuning.

In addition to this simple comparison of attributes, I have now added a process to take advantage of the similarity candidate relationships to compare family relationships.

Let’s start with this example of a sub-graph pattern:

The relationship CHILD_OF is created whenever there is a “Son” or “Daughter” in the “Relation to Head” field. Likewise, we can create other gender-neutral relationships like MARRIED_TO, SIBLING_OF, NIECE_NEPHEW_OF, etc….

In this case, the resident in question is the 1901 record for William. When we are evaluating the 1911 record of William as a potential match we can explore other residents in the same house as evidence of similarity.

The diagram above shows that both records have a CHILD_OF relationship to the two “Charles” records which furthermore are linked via a SIMILARITY_CANDIDATE relationship. Because of this we can say that there is a greater chance that the two “William” records represent the same person.

This only gives us the ability to find these relationships between the head of the family and other residents. What about generically matching based on the relationship of any two residents of a house?

Let’s say that Charles died sometime between 1901 and 1911. If his wife Hannah takes over as the head of the family, we would have a sub-graph which looks like this:

We could say that when we have the paths -CHILD_OF-><-MARIED_TO- and -CHILD_OF-> on either side that we can build our case for a match a bit more. This kind of matching can be used on all of the other residents of the house with SIMILARITY_CANDIDATE relationships.

For example, -CHILD_OF-><-CHILD_OF- could be matched to -CHILD_OF-><-CHILD_OF- even in this case, where the wife becomes the head of the house. Or if a child becomes the head then it could be compared to a -SIBLING_OF- relationship.

The code

So how do we actually do this with Neo4j? First let’s take our sub-graph and turn our nodes into variables:

In this example let’s take resident h1 r1 (house 1, resident 1) as the resident in question and h2 r1 as the candidate that we want to compare it to. This is the sort of query that Neo4j is wonderful at both performing quickly and making easy to formulate.

Let’s look at part of the Ruby code:

def get_similarity_candidate_relationship_paths
  self.query_as(:h1_r1)
    .match('(h1:House), (h2:House)')
    .match('h1<-[:LIVES_IN]-h1_r1-[sc_1:similarity_candidate]-(h2_r1)-[:LIVES_IN]->h2')
    .match('h1<-[:LIVES_IN]-h1_r2-[sc_2:similarity_candidate]-(h2_r2)-[:LIVES_IN]->h2')
    .match('path1=h1_r1-[:born_to|married_to|grandchild_of|niece_nephew_of|sibling_of
     |cousin_of|child_in_law_of|step_child_of*1..2]-h1_r2')
    .match('path2=h2_r1-[:born_to|married_to|grandchild_of|niece_nephew_of|sibling_of
     |cousin_of|child_in_law_of|step_child_of*1..2]-h2_r2')
    .pluck(
      :h2_r1,
      'collect([path1, rels(path1), path2, rels(path2)])'
      ).each_with_object({}) do |(r2, data), result|

    result[r2] = data.inject(0) do |total, (path1, rels1, path2, rels2)|
      relations1 = relation_string_from_path_and_rels(path1, rels1)
      relations2 = relation_string_from_path_and_rels(path2, rels2)

      if relations1 == relations2
        1.0
      elsif score = (RELATION_EQUIVILENCE_SCORES[relations1] || {})[relations2]
        score
      else
        -2.0
      end + total
    end
  end
end

Here we start with a Cypher query using the Query API from neo4j.rb. The object upon which we’ve called get_similarity_candidate_relationship_paths is our h1_r1 anchor.

Note here that we match paths with a length of either one or two relationships long from between two residents of the same house. Then we return all residents found via the SIMILARITY_CANDIDATE relationship from our anchor and the family relationship paths aggregated into an Array.

Once the Cypher query returns data we call relation_string_from_path_and_rels which is a way of transforming the path into a string like -BORN_TO-><-BORN_TO. This string gives us a simple way to express the path between the two residents as a string.

We then can give a score based on the two paths. If the paths are the same then we say that the score is 1.0. If the pair of paths is something like -BORN_TO-><-BORN_TO and -SIBLING_OF-> then we can give a score based on a lookup.

We add these scores up to give us a total score comparing our anchor resident and each of it’s similarity candidates. All with just one query to the database.

Challenges

There are a couple of things that I needed to do to make this work:

Previously, I was simply grabbing one resident at a time, finding all of the similarity candidates, and then creating a set of relationships to link the resident with the candidates and to store the record linkage scores (both the individual scores for fields and the total score).

However, this approach requires all of the candidates in the house to have SIMILARITY_CANDIDATE relationships in order to compare family relationships. So now I first process all residents for a house to create the similarity candidate relationships and store the record linkage scores and then go through them again with the graph-based comparisons and store that score and update the total.

Beyond that, there is the conceptual problem of determining the scoring when comparing paths. For example, if somebody was BORN_TO the head one year but their spouse takes over as the head, could we say that they’re BORN_TO the spouse if they are are a step-child? Family relationships are complicated and don’t always fit neatly into our properties and algorithms.

Conclusion

Most record linkage focuses on the properties of an object, but we need to remember that relationships are data about our entities too. With Neo4j, we have a powerful tool for analyzing those data relationships naturally and quickly.

Additionally I have found that the ability to create relationships on the fly to aggregate calculations like the ones discussed above is a wonderful way to find the best solution quickly.

Want to learn more about how to use Neo4j for your project? Click below to get your free copy of the Learning Neo4j ebook and catch up to speed with the world’s leading graph database.

Learn Neo4j Today