Discover Aura Free: Week 39 – Nobel Prize


The Nobel Prizes in Physics, Chemistry, Medicine, and Literature — and the Nobel Peace Prize — were awarded over the last few weeks, and we thought it’d make a nice dataset to import and query as a graph. There are a few dimensions to it, and you can use it to create a knowledge graph if you connect it to papers, authors, publications, and research data.

If you missed our livestream, here is the recording:

First, I found a Kaggle Dataset with some CSV and JSON data, but that ended in 2019 — fortunately, I spotted a link to the original data source with API links to https://nobelprize.org.

Alfred Nobel Prize Medal

So I went there and looked at this year’s Prizes and some Lesser Known Facts, which we can also use for our queries:

  • People that were awarded more than one prize
  • Age of laureates
  • Most common affiliations with institutions and their impact
  • Breakdown by country
  • Years without awards
  • Sadly, the low percentage (10%) of non-white-male recipients of Nobel Prizes

Data Source

In the footer of the page, there’s actually a link to a developer page, which is great.

They have a new v2.1 REST API with an OpenAPI specification, so you can grab prizes (664) and laureates (981) with a lot of detail and pagination. The developer page also links to a Linked Data (RDF) API with a SPARQL endpoint.

There is also an older v1 API that provides both JSON and CSV outputs based on the same data (I checked the id’s). That’s also what the code-example on the developer page talks about.

To make it easy to start, I just used the option of getting the v1 API responses as CSV for prizes and laureates.

We can later add to the data in the graph by querying select parts of the data as JSON from the new v2 API and merging them into our graph.

Data Model

We developed the data model incrementally based on the data in the CSV, our understanding, and the questions we wanted to ask.

Nobel Prize Data Model

So we got as nodes:

  • Prize (Nobel Prize)
  • Person who received the Prize
  • Year for the Prize
  • Category for the Prize
  • Institution the person is affiliated with
  • Country for birth, death of the person, and the institution

Create a Neo4j AuraDB Free Instance

Go to https://dev.neo4j.com/neo4j-aura to register or log into the service (you might need to verify your email address).

After clicking Create Database you can create a new Neo4j AuraDB Free instance.

Create Instance

Choose the “Empty Instance” option as we want to import our data ourselves.

On the Credentials popup, make sure to save the password somewhere safe. It’s best to download the credentials file, which you can also use for your app development.

AuraDB Credentials Download

The default username is always neo4j.

Then wait two to three minutes for your instance to be created.

Afterwards, you can connect to the instance via the “Open” Button with Workspace (you’ll need the password), which offers the “Import” (Data Importer), “Explore” (Neo4j Bloom), and “Query” (Neo4j Browser) tabs to work with your data.

Connect Dialog Workspace

On the database tile, you can also find the connection URL: neo4j+s://xxx.databases.neo4j.io (it is also contained in your credentials env file).

If you want to see examples of programmatically connecting to the database go to the “Connect” tab of your instance and pick the language of your choice.

Instance Details

Data Import

Unfortunately, the prize itself has no real unique id; in places where they refer to it, they use a combination of category and year.

So I used xsv select 1-8,category,year prize.csv > prize2.csv to duplicate the two columns at the end. And then ran a regular expression replacement in VS Code to replace the comma between the last two elements with a dash, so this is now our id column, called categoryYear with entries like chemistry-2022.

The other bit that we had to fix was to replace dates like 0000-00-00 with nothing in the laureates CSV (i.e. a null value) and also replace the -00-00 suffix from some dates with nothing as well so that just the year remained (but the column can still be imported as datetime) as Cypher’s date functions don’t like the zero value months and days.

Then we mapped out the different fields to nodes and relationships — thankfully data importer often pre-filled the mapping for us for the relationships and ids.

Neo4j Data Importer with CSV and Mapping

One particular aspect where we changed the mapping in the model was to extract three Country meta-nodes to represent the countries coming from the three different sources (born, died, institution) and create the right entries and relationships. Each of those three country mappings has the same property name that the different source column names are mapped to.

Data Importer Preview

After finishing the mapping we could run the preview, see that we mapped our data correctly, and then click “Import.”

Explore

After import, we’re sent directly to the Explore tab which gives us an initial view of our data “Show me a graph.” We can now style our data by picking the right captions and icons in the right-side legend.

Explore: Show me a Graph

We can also explore our data starting from a node, here Harvard Medical School and then expanding the pattern to people, their prizes, and years. After getting the results we can select all (Cmd+A/Ctrl-A) and choose “Expand All” from the context menu, so we get a more complete picture of the context of that institution.

Explore: Context of Institution

Query

To answer some of the initial questions from the facts section, we moved to the “Query” tab.

First looking at Laureates with more than one prize, we can express that as a pattern, of people having received two prizes.

Most that show up here are organizations — it gets interesting when looking at people who won prizes in different categories with a WHERE p1.category <> p2.category, which are actually just two “Marie Curie” and “Linus Pauling.”

match (p1:Prize)<-[:RECEIVED]-(p)-[:RECEIVED]->(p2:Prize)
where id(p1)<id(p2)
return p1.category, p1.year, p.firstname + p.surname as name, p2.category, p2.year
order by p1.category asc
Query: Multiple Prizes

Alternatively, you can also query for the base pattern and then aggregate per person how many and which prizes they got and filter for recipients that had more than one.

match (p:Person)-[:RECEIVED]->(pr:Prize)
with p, collect(pr) as prizes
where size(prizes) > 1
return p.surname, p.firstname, size(prizes) as count,
[pr in prizes | pr {.category, .year}] as prizes

Affiliations with institutions have a big impact on the Nobel prize, as you can see in the following query, with institutions from the US being over-indexed.

match (i:Institution)<-[:AFFILIATED_WITH]-()-[:RECEIVED]->(pr:Prize)
return i.name, i.country, i.city, count(*) as count order by count desc limit 20
Query: Institutions

To compute the age, we turn the year of the prize into a date (date({year:prize.year})) and the born and died datetimes from the data importer into dates (date(p.born)), then we can compute the age difference by using duration.beetween(date1, date2).years and sort accordingly.

match (p:Person)-[:RECEIVED]->(pr:Prize)
return p.firstname, p.surname, p.born, pr.year, duration.between(p.born, pr.year).years as years
order by years asc
limit 10
Query: Age — Youngest

Nominations

There is also data on nominations available — actually quite a lot with 20,424 nominations.

Unfortunately, you cannot access it through the API, just through a crude PHP search interface with HTML output. So to get that data you’d have to scrape it from the web.

There is also a visualization page available — perhaps that’s an easier way to get to the data. (It seems it is via this URL that returns JSON).

We also learned that nomination data is kept secret for 50 years, so the latest data available is from 1971. Probably to keep feuds, bribery, and similar research vengeance until after the laureates and nominators are dead.

Conclusion

As mentioned in the introduction, this dataset can be nicely combined with citation datasets and perhaps research grants and projects in general. So you could see how the influence of Nobel laureates spreads across the research networks and which institutions are perhaps more privileged than others.

Definitely a good starting point for a research knowledge graph. Let us know in the comments if you have more ideas or found this useful.





Discover Aura Free: Week 39 — Nobel Prize was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.