Developer

Analyzing Roland Garros and US Open Tennis Tournaments Via Neo4j

Ali Emre Varol

Data Scientist and Certified Neo4j Professional

July 5, 2022

10 min read

Neo4j: Simple Joyful Traversals

In my opinion, one of the most decent and outstanding sports is tennis. I love watching the buttery slice series, tweeners, wedding proposals, imitations, and impersonations; my favorite copycat is Djokovic. One unforgettable funny moment in a tennis tournament was the marriage proposal made to Steffi Graf and her response.

Outline

Introduction and Motivation
Dataset
The Graph
Analysis
– Finals of the tournaments
– Champions and runners-up of the tournaments
– Players who were the runners-up in the previous year and the winners
the following year
– Players who lost before QF in the previous year but won the tournament
the following year
– Players who have been champions at least twice
– Tournament winning streaks
– Players who have been runners-up at least twice and streaks
– Sweepers in the finals
– Route to trophy
Conclusion
References

Introduction and Motivation

The ATP and WTA organize four major tennis tournaments called the Grand Slams each year. The Grand Slam tournaments and their planned dates are as follows:

Australian Open (January)
Roland Garros (French Open) (May — June)
Wimbledon (June — July)
U.S. Open (August — September)

Grand Slam tournaments last two weeks, and in the second week — when the fourth round matches, quarterfinals, semifinals, and finals occur — the quality of play is generally better.

In 2022, Roland Garros and the U.S. Open were held between May 16 and June 5 and August 29 to September 11, respectively. In this blog post, I will review the Roland Garros and U.S. Open tournaments between 2000 — 2021 with the help of Neo4j. I wanted to analyze all four, but I will only explore these two tournaments due to the constraints of free AuraDB.

Learning to play tennis is as tricky as learning tennis terms. Many of these terms also consist of words that are difficult to understand. They are the words we rarely hear in daily life. Because it’s more enjoyable to watch tennis tournaments once you’ve mastered tennis terminology. I will share as many terms as we need in this blog post. Without much ado, let’s hit the ball 🥎.

Tennis is a four-point game in which a two-point lead must win. These four points are:

No points are scored: Love
1 point scored: 15 points
2 points scored: 30 points
3 points scored: 40 points
4 points earned: Set point (set over)

For a tennis player to win a game, he/she must win by at least a two-point lead. If the score is tied at 40 to 40 (deuce), it extends until one player wins by a two-point lead (an advantage point and a point). If the player with an advantage point loses the next point, the score will be deuce again.

A set is won when a player has won a minimum of six games with a two-game advantage over his opponent. For example, the possible score for a six-game set could be 6–0 or 6 –1 or 6 –4 but not 6 –5. A player must win two consecutive games before winning a set in a scenario where the score is tied at 5–5. For example, a player may win a set with a score of 7–5 or 8–6.

In Grand Slams, winning the men’s and women’s singles events requires going through seven rounds and matches. Men have to win three sets of a possible five to win a match, and women have to win two sets of a possible three.

Rounds of Grand Slams start with 128 players for single (R128) and 64 players (R64) for double. After each round, the number of players remaining is halved. For example, 32 players (R32) remain for men’s single after two rounds. 16 (R16) remaining after the third round. And then, eight players. After that, quarter-final (QF), semi-final (SF), then final (F).

One of my primary motivations for writing this article is that I love watching tennis matches, particularly Grand Slams. The other is to showcase Neo4j’s abilities in analyzing sports competitions and tournaments.

Dataset

For graph generation, we will use the singles dataset curated by Jeff Sackmann in the tennis_wta and tennis_atp repositories. Jeff’s repositories include CSV files containing all the matches on the Women’s WTA tournaments between 1920 and 2022 and the Men’s ATP tournaments from 1968 to 2022. Strictly speaking, he always keeps the repositories up-to-date. Great thanks to Jeff Sackmann for curating the datasets.

JeffSackmann – Overview

As I mentioned above, we will use the WTA and ATP datasets between 2000 and 2021. I merged and filtered them out using Pandas and saved them to my repository for simplicity.

blogposts/medium/tennis at main · iamvarol/blogposts

The Graph

First off, if you’re a developer and are not familiar with Neo4j, you should start here to acclimate yourself. In short, Neo4j is one of the industry-standard graph databases that offers alternative solutions for developers. Products include Neo4j Desktop, AuraDB, AuraDS, Bloom, Graph Data Science, etc.

The graph data model is shown below. Generally speaking, it tells us a player can win or lose a match, and matches in tournaments are lined according to rounds (from R128 to F). The annual tournaments are likewise arranged according to their years.

The node labels for the graph include Player (id, name, gender, hand, ioc), Match (id, year, round, score), Set (id, score, number), and Tournament (id, name, year, type). The relationships for the graph include MATCH_WINNER, MATCH_LOSER, IN_TOURNAMENT, IN_MATCH, NEXT_TOURNAMENT, and NEXT_MATCH.

It is time to set up restrictions complying with the data model. We will create unique node property (id) constraints for Player, Match, Set, and Tournaments. These constraints will prevent the creation of duplicate nodes in the graph generation phase.

On the other hand, in Neo4j, when we define a constraint, we also set out an index implicitly. We get an index on the label and properties that will reduce time in the graph creation phase.

I used separate code snippets for WTA and ATP tournaments to build the Graph, but both are in the same sense, only by changing the relevant parameters.

In my opinion, when creating a graph, the critical part is to create a data model and logic that can be easily queried — that is, traversed. To illustrate this with the example, we have players, sets, matches, and tournaments.

Each tournament consists of matches in different rounds. Each match has sets played by the players. The player wins the match if he/she wins enough sets. The player who wins their match in all rounds becomes the tournament’s champion.

To load the CSV file and create the nodes, we will use the apoc.periodic.iterate procedure from the APOC library, which is great for processing large amounts of data in one transaction. APOC is the abbreviation of the Awesome Procedures On Cypher, an add-on library for Neo4j. It provides a lot of practical procedures and functions to facilitate and speed up transactions.

We establish a NEXT_TOURNAMENT relationship separately for Roland Garros and U.S. Open by year.

We create a NEXT_MATCH relationship for the matches in each tournament based on the rounds.

Thanks to the NEXT_TOURNAMENT and NEXT_MATCH relations, we will have the opportunity to make inquiries between tournaments and matches now.

After running all the graph-related code snippets, we will have more than 45K nodes and more than 75K relationships. The below visualization only shows 20 percent of the nodes and relationships. When we run the analysis code snippets, we will see that none of the processing of the responses takes more than seconds. Therefore, we can safely conclude that AuraDB is an impressive Graph DB for storing and processing, even though I used the free plan.

Analysis

The chart below shows the 20 countries with the highest participation according to the players’ involvement. When this chart is examined, the U.S., France, and Spain are the top three countries, respectively.

Finals of the tournaments

If we set the match round to F Match {round:"F"}, we can traverse to finals in tournaments.

Champions and runners-up of the tournaments

After pivoting the above table, we can list the winners of the tournaments by year in male and female categories as seen below. The kings are Rafael Nadal (13) and Roger Federer (5) in Roland Garros and U.S. Open, respectively. Justine Henin(4) and Serena Williams (5) are the queens of Roland Garros and U.S. Open, respectively.

Novak Djokovic and Roger Federer share the same position in Roland Garros four times when we check the runners-up. Kim Clijsters, Dinara Safina, and Simona Halep are in the same place in the women’s singles in Roland Garros two times. Novak Djokovic (6) and Serena Williams (4) are listed as top runners-up in the U.S. Open tournaments.

When we evaluate Roland Garros and U.S. Open together, Rafael Nadal is clearly ahead in men’s singles, while Novak Djokovic is the leading runner-up.

And interestingly, Serena Williams dominates women’s singles with both her championships and her runners-up. This dominance is due to Serena Williams’ performance at the U.S. Open.

Players who were the runners-up in the previous year and the winners the following year

When we examine runners-up in the previous year who became champions the following year, Novak Djokovic (3) in men’s singles and Serena Williams (2) in women’s singles come to the fore.

Players who lost before QF in the previous year but won the tournament the following year

The table below shows the players who did not reach the quarterfinals the previous year and who became the champions the following year.

Particularly noteworthy players here are those who were eliminated in the first round last year (R128) and became champions the following year:

Dominic Thiem
Jelena Ostapenko
Stan Wawrinka
Serena Williams
Francesca Schiavone
Justine Henin
Albert Costa
Jennifer Capriati

Players who have been champions at least twice

With the help of the query below, we find players who have been champions at least twice in a tournament.

Tournament winning streaks

Tournament winning streaks are essential in evaluating the players. After applying a function to the above dataframe, we can find out the winning streaks in Roland Garros and U.S. Open between 2000 and 2021.

Players who have been runners-up at least twice and streaks

With the help of the query below, we find players who have been runners-up at least twice in a tournament.

After tweaking the results, we can find out the runners-up streaks in Roland Garros and U.S. Open between 2000 and 2021. As listed below, Roger Federer was the runners-up three times in 2006, 2007, and 2008.

Sweepers in the finals

There are different analyses we can do by considering the sets. Undoubtedly, the most important of these will be to find the champions without losing any sets throughout the tournament — that is, sweepers. When we look at the list below, it is seen that women champions are generally more talented in this regard 👏.

Route to Trophy

I stated that the NEXT_MATCH and NEXT_TOURNAMENT relations would help us a lot in graph queries. With the help of these relationships, we answered the above questions very easily and quickly. Finally, we’ll use these relationships to look at the opponents the champions face in each round and their match scores on their journey to the trophy.

As the champion of the Roland Garros 2021, Novak Djokovic started his first match with Tennys Sandgren in the series leading to the final.

Then, respectively, he won the games he played with Pablo Cuevas, Lorenzo Musetti, Ricardas Berankis, Matteo Berrettini, and Rafael Nadal and advanced to the finals. He became the 2021 Roland Garros champion by defeating Stefanos Tsitsipas in the final.

The champion of the Roland Garros 2021, Novak Djokovic’s Journey

Conclusion

If I had to describe Neo4j in three words, they would be “simple joyful traversals.”

I’m sure the above analysis can be done somehow with SQL queries since SQL is a powerful language that has been used for a long time. At the same time, it is clear that too many JOINS that hold relationships between tables will be used to do the above analysis with SQL.

The above analysis will be a burden, as each join will make the query complex and time-consuming to complete. However, how easily we do this using Cypher Query Language can be seen.

To make such a good comparison, I highly and kindly recommend that you read this article recently published by Michael Hunger. Also, you can check out my previous Neo4j-related blogposts.

The notebook we’ve worked through can be found here. I hope you fork it and modify it to meet your needs. Pull requests are always welcome!

Thank you for reading! You can reach out to me on LinkedIn, GitHub, and Twitter!

References

Analyzing Roland Garros and US Open Tennis Tournaments via Neo4j was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.