Discover Neo4j AuraDB Free – Week 24: NYTimes Article Knowledge Graph

Head of Product Innovation & Developer Strategy, Neo4j

April 11, 2022

12 min read

The role of journalists in current times can’t be overstated. We all rely on critical and honest journalism to chronicle and alert us to the perils of humankind.

But news is much more than just the articles you see on the front pages. A lot of the detail is hidden behind metadata that’s not immediately visible.

If you look at the front page of the New York Times you see a lot of moving articles and images.

The New York Times – Breaking News, US News, World News and Videos

Today we want to look behind the scenes, using the NYTimes API to access some of the article metadata.

If you’d rather watch the livestream video, have fun! Otherwise, follow along with this article.

Our colleague Will Lyon prepared a repository with a data model and import script for the API of popular articles, as well as a Neo4j Browser guide. He originally created the data exploration for an investigative journalism conference (NICAR).

You can go there to follow along.

GitHub – johnymontana/news-graph: New York Times articles as a knowledge graph

Let’s first create our AuraDB Free instance to get you going.

Please pick the “blank database.”

connect-aura.adoc

Developer API

First, sign up to the NYTimes developer API, and add an app to get your API Key.

They provide a lot of different APIs, including book and movie reviews, semantic categories, article archive metadata, and popular articles.

To keep things simple we’re looking at the straightforward “popular articles” API, which is just a JSON endpoint that provides the most popular articles for the last 7 or 30 days (just change the number in the URL to adjust your time frame).

https://api.nytimes.com/svc/mostpopular/v2/viewed/7.json?api-key=<key>;

Here you can see what part of the response looks like.

{
	"status": "OK",
	"copyright": "Copyright (c) 2022 The New York Times Company.  All Rights Reserved.",
	"num_results": 20,
	"results": [{
			"uri": "nyt://article/8428defe-c56e-5177-8798-ea2fbc3ef715",
			"url": "https://www.nytimes.com/2022/04/06/us/politics/us-russia-malware-cyberattacks.html",
			"id": 100000008282002,
			"asset_id": 100000008282002,
			"source": "New York Times",
			"published_date": "2022-04-06",
			"updated": "2022-04-07 10:45:47",
			"section": "U.S.",
			"subsection": "Politics",
			"nytdsection": "u.s.",
			"adx_keywords": "Russian Invasion of Ukraine (2022);Cyberwarfare and Defense;United States International Relations;Espionage and Intelligence Services;Biden, Joseph R Jr;Justice Department;Federal Bureau of Investigation;GRU (Russia);Russia;Ukraine",
			"column": null,
			"byline": "By Kate Conger and David E. Sanger",
			"type": "Article",
			"title": "U.S. Says It Secretly Removed Malware Worldwide, Pre-empting Russian Cyberattacks",
			"abstract": "The operation is the latest effort by the Biden administration to thwart actions by Russia by making them public before Moscow can strike.",
			"des_facet": [
				"Russian Invasion of Ukraine (2022)",
				"Cyberwarfare and Defense",
				"United States International Relations",
				"Espionage and Intelligence Services"
			],
			"org_facet": [
				"Justice Department",
				"Federal Bureau of Investigation",
				"GRU (Russia)"
			],
			"per_facet": [
				"Biden, Joseph R Jr"
			],
			"geo_facet": [
				"Russia",
				"Ukraine"
			],
			"media": [{
				"type": "image",
				"subtype": "photo",
				"caption": "Some American officials fear that President Vladimir V. Putin of Russia may be biding his time in launching a major cyberoperation that could strike a blow at the American economy.",
				"copyright": "Mikhail Klimentyev/Sputnik",
				"approved_for_syndication": 0,
				"media-metadata": [
					{
						"url": "https://static01.nyt.com/images/2022/04/06/us/politics/06dc-russia-hacks-1/merlin_204742779_ca6a0b7b-3630-426c-9ee7-77628e11521b-mediumThreeByTwo440.jpg",
						"format": "mediumThreeByTwo440",
						"height": 293,
						"width": 440
					}
				]
			}],
			"eta_id": 0
		},

This is the article data and metadata that we will import/turn into a graph and explore.

Data Model

The data model is based on the data we get back from the API.

We have the main Article node with properties like:

title
id
url
byline (Authors)
source
published_date
abstract

There are also others, but we’ll ignore those sections for now.

Additionally, we get metadata for each article:

Topics (des_facet)
Organizations (org_facet)
People (per_facet)
Locations (geo_facet)
Photos (media)

Those metadata entries can be turned into their own nodes and connected to the article via relationships (see the model diagram). Via those nodes, we can then correlate and relate articles.

Import

We’ve seen the direct response from the JSON API in our browser. To load the data into Neo4j we can utilize apoc.load.json, a custom procedure that provides the response of the API as Cypher data structures.

It is one of the procedures that’s also available in AuraDB.

https://neo4j.com/docs/aura/current/getting-started/apoc/#_apoc_load

Let’s start by just fetching the data.

call apoc.load.json("https://api.nytimes.com/svc/mostpopular/v2/viewed/7.json?api-key=<api-key>")

In Neo4j Browser, we can actually set the API key as a parameter with :param key⇒'<api-key>'. Then we can use $key in our query and don’t have to expose our key.

call apoc.load.json("https://api.nytimes.com/svc/mostpopular/v2/viewed/7.json?api-key="+$key)

You’ll see it looks the same as before. What we need to do now is take the value result and iterate over the results array.

Let’s do that with UNWIND (that turns lists into rows) and instead of the full entry return its individual parts.

call apoc.load.json("https://api.nytimes.com/svc/mostpopular/v2/viewed/7.json?api-key="+$key)
yield value
unwind value.results as article
return article.id, article.title, article.published_date, article.byline, article.geo_facet, article.des_facet;

This is now the data we can use to create our article nodes.

╒════════════╤════════════╤════════════╤════════════╤════════════╕
│"title"     │"article.pub│"article.byl│"article.geo│"article.des│
│            │lished_date"│ine"        │_facet"     │_facet"     │
╞════════════╪════════════╪════════════╪════════════╪════════════╡
│"Satellite i│"2022-04-04"│"By Malachy │["Ukraine","│["Russian In│
│mage"       │            │Browne, Davi│Russia","Buc│vasion of Uk│
│            │            │d Botti and │ha (Ukraine)│raine (2022)│
│            │            │Haley Willis│","Kyiv (Ukr│","Civilian │
│            │            │"           │aine)"]     │Casualties"]│
├────────────┼────────────┼────────────┼────────────┼────────────┤
│"Grammys 202│"2022-04-03"│"By Shivani │[]          │["Grammy Awa│
│2 Wi"       │            │Gonzalez"   │            │rds","Gospel│
│            │            │            │            │ Music","Fol│
│            │            │            │            │k Music","Cl│
│            │            │            │            │assical Musi│
│            │            │            │            │c","Pop and │
│            │            │            │            │Rock Music",│
│            │            │            │            │"Blues Music│
│            │            │            │            │","Jazz"]   │

We will use MERGE, which is a get-or-create (or upsert), so if the data already exists in our graph it’s not added again — it’s just provided to our statement.

Our key to merging the article is its URL, which should be globally unique, and then SET the remaining attributes.

(For larger data volumes and consistency we’d also create a constraint, but we’re skipping that here.)

call apoc.load.json("https://api.nytimes.com/svc/mostpopular/v2/viewed/7.json?api-key="+$key) yield value

unwind value.results as article 
// uniquely create an article
  MERGE (a:Article {url: article.url})
    SET a.title     = article.title,
        a.abstract  = article.abstract,
        a.published = datetime(article.published_date),
        a.byline    = article.byline,
        a.id = article.id
        a.source = article.source;

So if we now go to the left sidebar and click on the Article label, we can see a number of lonely article nodes floating in space. You can push them around and query them by attributes, but not yet follow their relationships.

// find some articles
MATCH (n:Article) RETURN n LIMIT 25;

// find articles by date
MATCH (n:Article)
WHERE n.published = datetime("2022-04-08")
RETURN n LIMIT 25;

// find articles with matching (case sensitive) title contents
MATCH (n:Article)
WHERE n.title contains 'Bucha'
RETURN n.title, n.byline
LIMIT 25;

In the next step, we’re starting to add relationships, first for the Topic nodes that are contained in the des_facet array of the article response.

A FOREACH iterates over that array creates the nodes, and connects them to the current article.

We also change the SET to ON CREATE SET so the properties are only added the first time a node is created.

call apoc.load.json("https://api.nytimes.com/svc/mostpopular/v2/viewed/7.json?api-key="+$key) yield value
unwind value.results as article
  MERGE (a:Article {url: article.url})
  ON CREATE SET a.title     = article.title,
        a.abstract  = article.abstract,
        a.published = datetime(article.published_date),
        a.byline    = article.byline,
        a.id = article.id,
        a.source = article.source

  FOREACH (desc IN article.des_facet |
    MERGE (d:Topic {name: desc})
    MERGE (a)-[:HAS_TOPIC]->(d)
  )

Topic Overlap and Recommendation

This already opens up some interesting aspects — for one, we already see clusters appearing of articles that share one or more topics but are totally unrelated to other sets of articles.

If you click on the HAS_TOPIC relationship in the left sidebar and remove the limit you should see something like this. You can go into full-screen mode and zoom out for a good overview.

This also gives us our first way of not just correlating articles for similarity but also using that “content based” similarity for recommendations.

If we draw out the pattern of (a1:Article)-[:HAS_TOPIC]->(topic)<-[:HAS_TOPIC]-(a2:Article)that represents our shared topics between articles.

We can use that to either find clusters visually like seen before or to compute how many topics a pair of articles shares.

We just prefix the pattern with MATCH and then count the number of topics per pair of articles.

MATCH (a1:Article)-[:HAS_TOPIC]->(topic)<-[:HAS_TOPIC]-(a2:Article)
// exclude opposite pair
WHERE id(a1)<id(a2)
RETURN a1.title, a2.title, count(topic) as overlap
ORDER BY overlap DESC limit 5;

╒══════════════════════╤══════════════════════╤═════════╕
│"title1"              │"title2"              │"overlap"│
╞══════════════════════╪══════════════════════╪═════════╡
│"A New Wave of Covid-"│"A New Covid Mystery" │4        │
├──────────────────────┼──────────────────────┼─────────┤
│"Russia Asked China f"│"U.S. Officials Say S"│3        │
├──────────────────────┼──────────────────────┼─────────┤
│"Satellite images sho"│"A makeup artist reco"│2        │
├──────────────────────┼──────────────────────┼─────────┤
│"Satellite images sho"│"Russian soldiers ope"│2        │
├──────────────────────┼──────────────────────┼─────────┤
│"New York Judge Dies "│"Sarah Lawrence Cult "│2        │
└──────────────────────┴──────────────────────┴─────────┘

If you anchor the starting article — e.g. as the one currently being read — you can recommend (similar) articles based on that topic overlap.

If you want to see which topics the articles overlap on, you can collect the names (aggregation function) into a list too.

MATCH (a1:Article)-[:HAS_TOPIC]->(topic)<-[:HAS_TOPIC]-(a2:Article)
// exclude opposite pair
WHERE id(a1)<id(a2)
RETURN a1.title, a2.title, count(topic) as score, collect(topic.name) as topics
ORDER BY overlap DESC limit 5;


╒══════════════════╤══════════════════╤═════════╤══════════════════╕
│"title1"          │"title2"          │"overlap"│"topics"          │
╞══════════════════╪══════════════════╪═════════╪══════════════════╡
│"A New Wave of Cov│"A New Covid Myste│4        │["Coronavirus (201│
│id-"              │ry"               │         │9-nCoV)","Disease │
│                  │                  │         │Rates","Tests (Med│
│                  │                  │         │ical)","Coronaviru│
│                  │                  │         │s Omicron Variant"│
│                  │                  │         │]                 │
├──────────────────┼──────────────────┼─────────┼──────────────────┤
│"Russia Asked Chin│"U.S. Officials Sa│3        │["Embargoes and Sa│
│a f"              │y S"              │         │nctions","Russian │
│                  │                  │         │Invasion of Ukrain│
│                  │                  │         │e (2022)","United │
│                  │                  │         │States Internation│
│                  │                  │         │al Relations"]    │

Other Metadata

Similarly, the other metadata turns into Geo, Person, Organization nodes with their appropriate relationships.

  call apoc.load.json("https://api.nytimes.com/svc/mostpopular/v2/viewed/7.json?api-key="+$key) yield value
unwind value.results as article
  MERGE (a:Article {url: article.url})
  ON CREATE SET a.title     = article.title,
        a.abstract  = article.abstract,
        a.published = datetime(article.published_date),
        a.byline    = article.byline,
        a.id = article.id,
        a.source = article.source

  FOREACH (desc IN article.des_facet |
    MERGE (d:Topic {name: desc})
    MERGE (a)-[:HAS_TOPIC]->(d)
  )
  FOREACH (per IN article.per_facet |
    MERGE (p:Person {name: per})
    MERGE (a)-[:ABOUT_PERSON]->(p)
  )

  FOREACH (org IN article.org_facet |
    MERGE (o:Organization {name: org})
    MERGE (a)-[:ABOUT_ORGANIZATION]->(o)
  )

  FOREACH (geo IN article.geo_facet |
    MERGE (g:Geo {name: geo})
    MERGE (a)-[:ABOUT_GEO]->(g)
  )

With MERGE our statement becomes idempotent — we can run it as often as we want and only new data gets added.

Our graph is also now more colorful with the new metadata nodes leading to new correlations.

Again, we can use this to improve our recommendations.

You can think of the overlap count as a score that can be weighted but also computed per metadata item, depending on reader preferences or importance.

Here is an example:

MATCH (a1:Article)-[:HAS_TOPIC]->(topic)<-[:HAS_TOPIC]-(a2:Article)
WHERE id(a1)<id(a2)
// score for topic overlap
WITH a1,a2, count(topic) as topicScore

// score for geo overlap
OPTIONAL MATCH (a1:Article)-[:ABOUT_GEO]->(geo)<-[:ABOUT_GEO]-(a2:Article)
WITH a1,a2, topicScore, count(geo) as geoScore

// score for people overlap
OPTIONAL MATCH (a1:Article)-[:ABOUT_PERSON]->(person)<-[:ABOUT_PERSON]-(a2:Article)
WITH a1,a2, topicScore, geoScore, count(person) as personScore

// compute total score with weights
RETURN a1.title, a2.title, topicScore*5+geoScore*2+personScore*1.5 as score
ORDER BY score DESC limit 5;

╒══════════════════════╤══════════════════════╤═══════╕
│"title1"              │"title2"              │"score"│
╞══════════════════════╪══════════════════════╪═══════╡
│"A New Wave of Covid-"│"A New Covid Mystery" │22.0   │
├──────────────────────┼──────────────────────┼───────┤
│"Russia Asked China f"│"U.S. Officials Say S"│18.5   │
├──────────────────────┼──────────────────────┼───────┤
│"Satellite images sho"│"A makeup artist reco"│16.0   │
├──────────────────────┼──────────────────────┼───────┤
│"Russian Blunders in "│"They Died by a Bridg"│16.0   │
├──────────────────────┼──────────────────────┼───────┤
│"Satellite images sho"│"They Died by a Bridg"│16.0   │
└──────────────────────┴──────────────────────┴───────┘

Imagine again that article 1 is tied to an article id or URL MATCH (a1:Article {id:$articleId})the user is currently reading. The list of articles returned would be recommendations.

Photos and Media

The Photos are in a bit of a nested structure, and different resolutions of the same picture, so we can pull them in by iterating over the media list and grabbing the caption and the third image URL.

In Neo4j Browser’s property pane for the nodes you can click the URLs for the pictures to see them.

call apoc.load.json("https://api.nytimes.com/svc/mostpopular/v2/viewed/30.json?api-key="+$key) yield value
unwind value.results as article
  MERGE (a:Article {url: article.url})
  ON CREATE SET a.title     = article.title,
        a.abstract  = article.abstract,
        a.published = datetime(article.published_date),
        a.byline    = article.byline,
        a.id = article.id,
        a.source = article.source

...

  FOREACH (media in article.media |
    MERGE (p:Photo {url: media.`media-metadata`[2].url})
      ON CREATE SET p.caption = media.caption
    MERGE (a)-[:HAS_PHOTO]->(p)
  )

Those photos could now be further analyzed with image recognition APIs to add more metadata.

Authors

The authors are unfortunately not available as metadata, but just in the byline attribute, which is a string like, "By Malachy Browne, David Botti and Haley Willis"`.

Fortunately in Cypher we have a number of string operations. We skip the first three letters, replace and with comma, and split the string by comma to get a list of authors.

Then we create a node for each of them and connect the articles. This doesn’t save us from duplicate Author names, but it’s better than not having them at all.

We can either do that as part of the import or as a post-processing step, which is shown here.

MATCH (a:Article)
// string operation to turn string into array of names
WITH a, split(replace(substring(a.byline, 3), " and ", ","), ",") AS authors
UNWIND authors AS author

// uniquely create author node
MERGE (auth:Author {name: trim(author)})
MERGE (a)-[:BYLINE]->(auth);

To import more data we can change the time range from 7 to 30 days to get all the popular articles for the last month.

Geodistance

We use another APOC procedure call for geocoding (lat, lon) of our Location metadata. This uses public geocoding APIs to resolve names to locations (openstreetmap).

MATCH (g:Geo)
CALL apoc.spatial.geocodeOnce(g.name) YIELD location
SET g.location = 
point({latitude: location.latitude, longitude: location.longitude})

Then we can determine how close they are to each other, or how close the reported articles are to our current location, which could also be used for ranking them for a viewer.

Here we use Berlin’s location as a starting point to compute distances in kilometers.

WITH point({latitude:52.5065133, longitude:13.1445557}) as berlin
MATCH (g:Geo)
RETURN g.name, round(point.distance(berlin, g.location)/1000) as dist
ORDER BY dist ASC

╒═══════════════════════════════╤═══════╕
│"g.name"                       │"dist" │
╞═══════════════════════════════╪═══════╡
│"Italy"                        │1099.0 │
├───────────────────────────────┼───────┤
│"Ukraine"                      │1310.0 │
├───────────────────────────────┼───────┤
│"Russia"                       │4689.0 │
├───────────────────────────────┼───────┤
│"China"                        │7120.0 │

Next Steps

Other extension points for the data would be additional entity extraction from the abstract, or pulling the actual article text and analyzing that.

Furthermore, we could add comments and data from other publications and sources like opencorporates, wikidata, gdelt, and more.

If you follow Will’s repository there are three other angles:

He’s building a News GraphQL API on top of the data
Using Cloudflare Edge Workers to serve location aware data
Extending GitHub’s FlatData approach with a FlatGraph that gathers source data from websites and APIs and writes them to Neo4j using GitHub Actions

Hope you enjoyed this expedition in the world of news metadata and it gives you a few ideas on how to start your own Graph Journey on AuraDB Free.

This is already week 24 of our exploration. If you want to see the other videos and datasets check out the page for our livestream and our GitHub repository.

Discover Neo4j AuraDB Free — Week 24 — NYTimes Article Knowledge Graph was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.