GraphGists

News with the words in common

Introduction

This graph was created for the master thesis project of a Brazilian student. She is researching about similar news in different web sites in Brazil. At this initial stage the news are being compared word by word.

The next step will be to apply Natural Language Processing techniques such as stemming, semantic searching, bag of words with distance vectors for similarity, among others.

The Project

Table 1. Table Basic structure
News HAS_SAME_TOKEN News

{"date": "2015-01-09 00:00:00", "journal": "r7", "title": "Polícia mata dois terroristas suspeitos de atentado à revista em Paris", "url": "https://noticias.r7.com/jornal-da-record/videos/policia-mata-dois-terroristas-suspeitos-de-atentado-a-revista-em-paris-13042015" }

{token:"polícia"}

{"date": "2015-01-09 00:00:00", "journal": "jn", "title": "Polícia mata irmãos terroristas Kouachi após caçada na França", "url": "https://g1.globo.com/jornal-nacional/noticia/2015/01/policia-mata-irmaos-terroristas-kouachi-apos-cacada-na-franca.html"}

The graph has a single relationship between nodes called HAS_SAME_TOKEN, that contains a token attribute, which is the word in common between the two news.

Crawlers written in Python with scrapy collected news from the following web sites, saving them as JSON files.

For the period between 2014-01-01 and 2015-02-28. A list of stop words was used to filter undesired words from the analysis (e.g. é, são, ser, algum).

The data was then indexed with Lucene. The Lucene index was queried and the results stored as CSV. Finally, the CSV files were loaded into the Neo4J graph.

Creating database

The dataset in this graphgist contains data for only a single day, 2015-01-09. The complete dataset can be found in https://github.com/kinow/crawlers-noticias, with 7398 news/nodes and 22998 edges.

Finding news with words in common

The query below displays news with the word 'polícia' (police in Portuguese). News have already been aggregated per day when the data was first collected. As the result would be too large, we are also filtering by the web site 'SBT', so that it is easier to visualize the results.

MATCH (n1:News)-[r1:HAS_SAME_TOKEN]->(n2:News) WHERE r1.token = 'polícia' AND n1.journal = 'sbt' RETURN *

The same result as a table but with all the web sites included.

MATCH (n1:News)-[r1:HAS_SAME_TOKEN]->(n2:News) WHERE r1.token = 'polícia' RETURN n1.date as DATE, n1.journal, n1.title, r1.token AS COMMON_WORD, n2.journal, n2.title

Neo4J was the perfect technology for quickly modeling the data collected by the crawlers, and to display it on a web interface. Initial tests with a relational database showed that it would require a complex model and further tuning in order to serve the data.