Reposted with permission by Brian Underwood – brian-underwood.blogs
In my last two posts I covered the process of importing data from StackOverflow
for the purpose of creating a combined MDM database. Now we get to the really fun stuff: today I show you how I tied the data together!
Tying the data together
The most obvious entities to link between StackOverflow and GitHub is the user profiles. Unfortuately users don’t always have two profiles, don’t fill out both profiles, or just don’t have perfectly matching information between the two. To this end I built record linkage
tools which helped me make links between users.
My goal is to create a
label which ties together nodes from GitHub and StackOverflow (and potentially other sites) as seen below:
My first step to linking records is to standardize data as much as possible. To this end I do a number of things:
- From the
- Pull the Twitter username to a
twitter_username property if there is a URL which points to twitter
- Pull the GitHub username to a
github_username property if there is a URL which points to twitter
- Extract all URIs to the
uris property as an array (no URI protocol or
- Extract any E-Mail addresses from the
about_me property to the
- Further reduce
emails to the
domains property (no paths)
- Make a
usernames property with the GitHub and Twitter usernames as well as the StackOverflow
- Extract the Twitter username from the
blog property if it points to twitter
- Standardize the
blog property to the
uris property as an array (no URL protocol or
- Further reduce
uris to the
domains property (no paths)
- Make a
usernames property with the GitHub and Twitter usernames
All of the above generated properties get downcased, stripped of extra whitespace, and stored without duplicates.
After all domains have been looked for, I run the followed to extract any domains which have been used more than once to find and eliminate domains like
query.match(u: :User).unwind(domain: 'u.domains').
with(:domain, count: 'count(domain)').where('count > 1').
With the data standardized we can now easily do record comparison and scoring. Here I go through the millions of combinations of user pairings and give each of them a score. For each pairing I want to create a relationship in the database so that I don’t need to calculate the score again. Since I don’t want to have millions of relationships connecting every single user to every other user, I only create relationships above a score threshold.
I created a class called
which allows me to create rules for comparing and scoring records.
identifier = ObjectIdentifier.new do |config|
config.default_threshold = 0.8
config.default_weight = 1.0
config.add_matcher :name, :display_name, &fuzzy_string_matcher
config.add_matcher :login, :display_name, &fuzzy_string_matcher
config.add_matcher :blog, :website_url, &fuzzy_string_matcher
config.add_matcher :location, :location, &fuzzy_string_matcher
config.add_matcher :login, :github_username, weight: 2.0, &exact_string_matcher
config.add_matcher :twitter_username, :twitter_username, weight: 2.0, &exact_string_matcher
config.add_matcher :domains, :domains, threshold: 0.98, &array_fuzzy_string_matcher
config.add_matcher :usernames, :usernames, threshold: 0.98, &array_exact_string_matcher
You can see that there are anonymous functions for both fuzzy and exact string matching.
object I can call
to get the following result:
[ :name, :display_name ] => 0.9769319492502884,
[ :login, :display_name ] => 0.9769319492502884,
[ :blog, :website_url ] => 0.0,
[ :location, :location ] => 1.0,
[ :login, :github_username ] => 0.0,
[ :twitter_username, :twitter_username ] => 0.0,
[ :domains, :domains ] => 0.0,
[ :usernames, :usernames ] => 0.0,
object, I then compare every combination of users between GitHub and StackOverflow. With 2,312
StackOverflow users and 6,255
GitHub users there are 14,461,560
potential pairs, which can take some time.
For each pairing my script gets the result hash and calculates a
as the sum of the Hash values. If that score is above a threshold, it creates a
Neo4j relationship to represent a hit. The relationship stores the
as well as a
property with a JSON representation of the result for debugging.
While comparing strings to strings I realized there was another aspect of the profile that I was missing: The profile photo.
If I could do fuzzy comparisons against the user images I could narrow down the results even further. Fortunately the
gem provides an implementation of the
algorithm to do just that. So I wrote a quick script to download the profile photos
locally and then another to compare the millions of combinations
. With all of that downloading and pHashing, I found that I also needed to use the
gem to complete in a reasonable amount of time.
So that I didn’t need to compare photos during my
process, I simply created a
relationship between users to store the result of the pHash (a value between 1 and 15 where 1 is the closest perceived match).
Using Neo4j I don’t need to worry about table structure or if I should create a collection or embed in existing objects. I can create whatever links between users that I need as I play with algorithms that will work for my purpose.
Man vs. Machine
To help guide my algorithms I used the only thing that I had available to me that could do a reasonable comparison of profiles: me. I turned my project into a Rails app which could show my two profiles side-by-side:
This view would show me a randomly chosen pair of users. 80% of the time those users were chosen because I had algorithmically determined some linkage between them. 20% of the time it would just show me a randomly chosen set of two users which hadn’t already been matched to acts as a control group. I created five links along the top of the page (“Definitely not”, “Probably not”, “Don’t know”, “Probably”, and “Definitely”) and threw in a bit of CoffeeScript to let me choose them via arrow keys.
In making decisions on hundres of pairs of profiles, a number of questions occurred to me:
- How should do I choose between the five levels?
- I made some rules to describe how I evaluate the evidence. By evidence I mean comparing name, location, photo, etc…
While I placed “Not sure” in the middle, that was a special case which means that I don’t have enough evidence.
- two pieces of matching evidence mean “Probably”
- three pieces of matching evidence mean “Definitely”
- two pieces of contradictory evidence means “Probably not”
- three pieces of contradictory evidence means “Definitely not”
- Should I follow links and otherwise research the users?
- No. If I want to determine how well my matching algorithms work, I should use only same data that they have.
- What about pictures?
- I happen to be a human and humans are pretty good at comparing photos, particularly photos of people. I was torn about this, but decided that if two photos were obviously not the same person, I should classify the profiles as different.
Analyzing the result
To figure out how well my classification algorithm worked I turned, of course, to R
. Using RNeo4j
I run a query to get the human score (from -2
as ‘Definitely not’ and 2
as ‘Definitely’) and the computer score (for the rules I specified this runs from 0
to about 6
query <- "MATCH
RETURN ci.score, toInt(i.index)"
result <- cypher(graph, query)
plot(result[], result[], xlab = "Human Score", ylab = "ObjectIdentifier score")
…and plot the results…
This shows me that I should be able to relatively safely classify an
score of around 3
and up as the same user. A score of 2.5
and above looks promising, though looking in that range for users with a human classification of
there are some false positives. I suspect I can improve the correlation between the two by adjusting the algorithm scoring/weights.
That’s plenty to digest for now, so next time I’ll show the sorts of queries that can be run on the linked database! You can see all the code on the GitHub repository
If you want to talk more in person, feel free to drop by one of my upcoming speaking engagements