(March Madness) <-[:MADE_SANE_WITH]- (Neo4j)


March GRAPHness


Download all the code needed to try it out for yourself HERE, or check out the GraphGist HERE.

March madness is a rare concord of well-documented data and pop culture. Warren Buffet’s billion-dollar bet grabbed the interest of everyone from Wall St. quants to Silicon Valley engineers to arm chair Money Ballers everywhere, and suddenly it paid off to be a big data geek.

It’s All Relative


To me, basketball is all about relationships — there are of course teams that are unambiguously better than others. However, there nearly always some sort of relative performance bias.

Where a team performs better or worse than their average performance would project due to some confluence of factors, whether it’s a team with a infamously brutal crowd of fans, a Point Guard that dissects your league-leading zone, or a decades-long rivalry that motivates your players to dig just a little more.

Performance is relative. These statistics are difficult to track across a single season and often incredibly difficult to track across time.

Secondly, being able to iterate on that model is taxing both in terms of writing the queries and in maintaining any reasonable performance on commodity hardware. I had a mountain of data from the past four seasons, including points scored, location, date, etc. etc. 

We could easily add more granular information or more historic data, but for no particular statistical reason and only because it made my life easier, I decided that in my model these relationships should churn almost entirely every four years (as current players graduate and move on).

Finally, we’re going to build our “win power” relationship between teams as a function of the Pythagorean Expectation model (More on that later).

STEP 1: Idea —> Graph Model


I am not a clever boy. However, I have several clever tools at my disposal.
The most chief of which is Neo4j. So, I started as I do all of my graphy projects — with the questions I planned to ask most frequently and a whiteboard (or a piece of paper in this case).

Which became…

March Madness - New Page


Which is a totally reasonable graph model for me to import data against.

STEP 2: Time


Before I loaded any data into Neo4j, I first needed to create the time-tree seen in the above model. One of Neo4j’s brilliant engineers (Thanks Mark!) did the heavy lifting for me and wrote a short Cypher snippet to generate the time-model I needed.

Screen Shot 2015-03-30 at 4.46.45 PM


The result is something like this:

Screen Shot 2015-03-25 at 4.35.22 PM


STEP 3: my.csv —> graph.db


Neo4j ships with a very powerful ETL tool called “LOAD CSV.” We’re going to use that.

I downloaded a mess of NCAA scores, then surreptitiously converted the data I downloaded from Excel spreadsheets into CSV format. I’ve hosted them in a public Dropbox found in the repo link above.

We’re bringing in several CSV files, each one representing a given season and then sewing that all together based on team names.

Screen Shot 2015-03-25 at 4.29.03 PM


STEP 4: History, Victory and a Little Math


I’ve decided to create a relationship between each team called :WINPOWER based on what’s called concept from baseball called Pythagorean Expectation.

:WINPOWER essentially assigns a win probability based on points scored vs. points allowed. I added in a decay factor to weigh more recent games more heavily than those played long ago.

Screen Shot 2015-03-30 at 4.49.15 PM


STEP 5: The Big Payout


Who should win between Navy and Michigan St.?

Screen Shot 2015-03-30 at 4.50.18 PM


We see that our algorithm predicts (correctly!) that Michigan St. will defeat Navy:

Screen Shot 2015-03-30 at 5.01.45 PM


Well…but what if they’ve never played each other? We can use the other teams they both played in common to determine a winPower:

Screen Shot 2015-03-30 at 4.53.56 PM


We see that Kentucky should (and did) beat Hampton!

Screen Shot 2015-03-30 at 4.57.43 PM


// kvg


Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.