Machine Learning, Graphs and the Fake News Epidemic (Part 2)

In last week’s post, we discussed why designing a fully automated fake news detector is currently infeasible and introduced a semi-automated, graph-based solution which would use machine learning to work alongside human fact checkers to scalably flag and quarantine fake news.

This post will provide an overview for such a solution, how to build the news graph and how to use it to leverage the relationships that exist in the new sphere.

Learn how the convergence of graph tech and machine learning are used to combat fake news, part 2

Why Use Graph Technology?

Graph databases are excellent at leveraging connected data. Neo4j’s native graph database architecture uses index-free adjacency, pattern-matching searches and graph traversals to scalably perform powerful graph analytics in real time.

To understand the how graph technology will serve as the framework of this solution, let’s begin by examining some potential user stories we’d like to implement, and how a graph will empower us to do so.

Say, for example, a user would like to be able to gauge the overall controversiality of an article. To access an article’s controversiality, we must cross-examine its contents against other articles and find the relative opinions of other articles towards the specific claims made by the article.

Essentially, we extend the stance detection task introduced in the fake news challenge; expanding it from merely detecting the stance of an article towards a single claim to the larger task of detecting the stances of a cluster of related articles towards the body of a single article.

To do this, we need to recognize potentially related articles and then identify and compare their stances towards each of the individual claims made by the target article. This may sound like a long leap from the FNC’s single-stance detection problem, but with the introduction of graph technology and storing the right data, it is well within reach.

The Power of Pattern Matching

In the case of our user story, let us define similar articles as any articles which mention an arbitrary number z of entities or topics in common.

By storing the entities and topics mentioned by articles into Neo4j as nodes with connections to their respective articles, we can use Cypher, Neo4j’s graph query language, to intuitively search through millions of articles in real time and return all articles similar to a specified article. Where a single SQL SELECT statement can involve multiple JOINs and WHERE clauses, Cypher can return the same result intuitively and free of painful JOINs.

The following Cypher query uses pattern matching to return all articles which mention at least two of the same topics or entities as a specified article, a1, with the title The Fake News Epidemic:

MATCH p = (a1:Article {title: 'The Fake News Epidemic'})
WITH count(p) AS commonality, a2.article_id
WHERE commonality >=2

The MATCH statement searches the graph for paths from a1 to other articles, denoted a2, through a mutually mentioned topic or entity, n. Watch the video at the top of this post for a sense of what this pattern might look like in our graph.

This result is then passed to the WITH statement which counts the number of matching paths from each node, a2, and denotes it as the commonality. The final lines return the article_id of all nodes with a commonality of at least two.

Similarly, by extracting important claims from each article and adding them to our graph, we can use Cypher to return all claims made by the article. From there, our problem is again reduced to the simple task of comparing each claim from the article against the body of each article in our first result set.

An implementation for the stance-detection model – as well as the graph algorithms used for topic, entity and claim extraction – will be discussed in later blog posts. For now, we will think of them as black box operations and continue on to an overview of how the news graph is assembled.

The News Graph

To fully utilize our news graph, we need to structure it to focus on important relationships in our dataset. After adding more nodes to store the authors and sources of the articles in our graph, as well as some useful properties for each of our nodes and relationships, we arrive at the following graph schema.

A fake news detection graph data model

These additional author and source nodes will allow us to extend our measurement for controversiality to those nodes as well.

By traversing out one level from authors and sources on the WROTE and PUBLISHED relationships, respectively, we can average the controversiality of the articles that they are connected to gauge their own overall controversiality. We can also use graph clustering methods on these nodes to identify communities which tend to consistently agree with one another.

While this is only one potential implementation of a fake news detection graph – with room for modification and improvement – its advantages over a relational model are clear.

Building the News Graph

To get from a disjunct set of articles into this tightly woven graph, however, requires some additional processing through our “black box” algorithms. Following blog posts will discuss these algorithms in detail, and even include some sample code and results, but for now a general understanding of their purpose in our system will suffice.

Notice the diagram below, which models the way data flows between our graph and various algorithms in order to construct a database matching the schema we specified earlier. Also note that the dotted lines indicate directed data flow, rather than graph edges, and each of the colored diamonds indicates an algorithm used to assemble a part of our graph, not nodes in our database.

Data flow diagram for a fake news detection using Neo4j

From start to finish, the construction process of our graph takes articles from our database, previously scraped from the web and runs them through an algorithm like Topic Modeling, to extract the topics in the article. The extracted topics are then fed back into the graph and connected back to the original article with the MENTIONS relationship. A similar set of steps is repeated for entity extraction, summary extraction, and each of the other algorithms shown in the data flow diagram.

What’s Coming Next Week

The next post in the series will show how we can use Cypher to load data into Neo4j and preprocess it to create inputs to our various algorithms.

Want in on projects like this? Click below to get your free copy of the Learning Neo4j ebook and catch up to speed with the #1 platform for connected data.

Learn Neo4j Today

Catch up with the rest of the blog series on machine learning, graphs and fake news: