Non-Text Discovery with ConceptNet as a Neo4j Database [Community Post]


[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

The Problem of Discovery


Discovery, especially non-text discovery, is hard.

When looking for a cool T-shirt, for example, I might not know exactly what I want, only that I’m looking for a gift T-shirt that’s a little mathy that emphasizes my friend’s love of nature.

As a retailer, I might notice that geometric nature products are quite popular, and want to capitalize by marketing the more general “math/nature” theme to potential buyers who have demonstrated an affinity for mathy animal shirts as well as improving the browsing experience for new visitors to my site.

Many retail sites with user-generated content rely on user-generated tags to classify image-driven products. However, the quality and number of tags on each item vary widely and depend on the item’s creator and the administrators of the site to curate and sort into browsable categories.

On Threadless, for example, this awesome item has a rich amount of tags:
lim heng swee, ilovedoodle, cats, lol, funny, humor, food, foodies, food with faces, pets, meow, ice cream, desserts,awww, puns, punny, wordplay, v-necks, vnecks, tanks, tank tops, crew sweatshirts, Cute
In contrast, this beautiful item has only a handful:
jimena salas, jimenasalas, funded, birds, animals, geometric shapes, abstract, Patterns
Furthermore, although a human might easily be able to classify an image with the tags [ants, anthill, abstract, goofy] as probably belonging to the “funny animals” category, an automated system would have to know that ants are animals and that goofy is a synonym for funny.

Knowing this, how would a retail site quickly and cheaply implement intelligent categorization and tag curation? ConceptNet5 and (of course), Neo4j.


ConceptNet5


This article introduces the ConceptNet dataset and describes how to import the data into a Neo4j database.

To paraphrase the ConceptNet5 website, ConceptNet5 is a semantic network built from nodes representing words or short phrases of natural language (“terms” or “concepts”), and the relationships (“associations”) between them.

Armed with this information, a system can take human words as input and use them to better search for information, answer questions and understand user goals.

For example, take a look at toast in the ConceptNet5 web demo:

Learn How to Leverage of Non-Text Discovery by using the ConceptNet Dataset within Neo4j


This looks remarkably similar to a graph model. The dataset is incredibly rich, including (in the JSON) the “sense” of toast as a bread and also as a drink one has in tribute.

Let’s take a look at the JSON response for one ConceptNet edge (the association between two concepts) and import some data into a Neo4j database for exploration:

{
     edges: 
     [
          {
               context: "/ctx/all",
               dataset: "/d/globalmind",
               end: "/c/en/bread",
               features: 
               [
                    "/c/en/toast /r/IsA -",
                    "/c/en/toast - /c/en/bread",
                    "- /r/IsA /c/en/bread"
               ],
               id: "/e/ff9b268e050d62255f236f35ba104300551b8a3b",
               license: "/l/CC/By-SA",
               rel: "/r/IsA",
               source_uri:                                              
               "/or/[/and/[/s/activity/globalmind/assert/,/s/
               contributor/omcs/bugmenot/]/,/s/umbel/2013/]",
               sources: 
               [
                    "/s/activity/globalmind/assert",
                    "/s/contributor/omcs/bugmenot",
                    "/s/umbel/2013"
               ],
               start: "/c/en/toast",
               surfaceText: "Kinds of [[bread]] : [[toast]]",
               uri: "/a/[/r/IsA/,/c/en/toast/,/c/en/bread/]",
               weight: 3
          },
}

Modeling the Database


For the purposes of this example, let’s model the database to have the following properties: Term Nodes:
    • concept
    • language
    • partOfSpeech
    • sense
Association Relationships:
    • type
    • weight
    • surfaceText
An alternate model could have “type” be the relationship label instead of a property, but for the sake of this blog post let’s keep types as properties. This allows us to explore the ConceptNet database without making assumptions about the types of relationships in the dataset.

Loading the Data into the Database


Let’s use the following Python script to upload some sample data:

import requests
import json
from py2neo import authenticate, Graph
 
USERNAME = "neo4j" #use your actual username
PASSWORD = "12345678" #use your actual password
authenticate("localhost:7474", USERNAME, PASSWORD)  
graph = Graph()

#sample_tags = ['fruit','orange','bikes','cream','nature', 'toast','electronic', 'techno', 'house', 'dubstep', 'drum_and_bass', 'space_rock', 'psychedelic_rock', 'psytrance', 'garage', 'progressive','Cologne', 'North_Rhine-Westphalia', 'gothic_rock', 'darkwave' 'goth', 'geometric', 'nature', 'skylines', 'landscapes', 'mountains', 'trees', 'silhouettes', 'back_in_stock', 'Patterns', 'raglans','giraffes', 'animals', 'nature', 'tangled', 'funny', 'cute', krautrock]

# Build query.
query = """
WITH {json} AS document
UNWIND document.edges AS edges
WITH 
SPLIT(edges.start,"/")[3] AS startConcept,
SPLIT(edges.start,"/")[2] AS startLanguage,
CASE WHEN SPLIT(edges.start,"/")[4] <> "" THEN SPLIT(edges.start,"/")[4] ELSE "" END AS startPartOfSpeech,
CASE WHEN SPLIT(edges.start,"/")[5] <> "" THEN SPLIT(edges.start,"/")[5] ELSE "" END AS startSense,
SPLIT(edges.rel,"/")[2] AS relType,
CASE WHEN edges.surfaceText <> "" THEN edges.surfaceText ELSE "" END AS surfaceText,
edges.weight AS weight,
SPLIT(edges.end,"/")[3] AS endConcept,
SPLIT(edges.end,"/")[2] AS endLanguage,
CASE WHEN SPLIT(edges.end,"/")[4] <> "" THEN SPLIT(edges.end,"/")[4] ELSE "" END AS endPartOfSpeech,
CASE WHEN SPLIT(edges.end,"/")[5] <> "" THEN SPLIT(edges.end,"/")[5] ELSE "" END AS endSense
MERGE (start:Term {concept:startConcept, language:startLanguage, partOfSpeech:startPartOfSpeech, sense:startSense})
MERGE (end:Term  {concept:endConcept, language:endLanguage, partOfSpeech:endPartOfSpeech, sense:endSense})
MERGE (start)-[r:ASSERTION {type:relType, weight:weight, surfaceText:surfaceText}]-(end)
"""

# Using the Search endpoint to load data into the graph
for tag in sample_tags:
	searchURL = "https://conceptnet5.media.mit.edu/data/5.4/c/en/" + tag + "?limit=500"
	searchJSON = requests.get(searchURL, headers = 
	{"accept":"application/json"}).json()
	graph.cypher.execute(query, json=searchJSON)

Exploring the Data


Use the following Cypher query to explore the data:

MATCH (n:Term {language:'en'})-[r:ASSERTION]->(m:Term {language:'en'})
WHERE 
NOT r.type = 'dbpedia' AND
NOT r.surfaceText = '' AND
NOT n.partOfSpeech = '' AND
NOT n.sense = ''
RETURN n.concept AS `Start Concept`, n.sense AS `in the sense of`, r.type, m.concept AS `End Concept`, m.sense AS `End Sense`
ORDER BY r.weight DESC, n.sense ASC
LIMIT 10

The ConceptNet dataset is incredibly rich, providing various “senses” in which someone might mean “orange” and provides a wide variety of “relationship types” to choose from.

    | Start Concept | in the sense of                                         | r.type     | End Concept     | End Sense
----+---------------+---------------------------------------------------------+------------+-----------------+-----------
  1 | orange        | colour                                                  | IsA        | color           |
  2 | orange        | film                                                    | InstanceOf | film            |
  3 | dynamic       | a_characteristic_or_manner_of_an_interaction_a_behavior | Synonym    | nature          |
  4 | garage        | a_petrol_filling_station                                | Synonym    | petrol_station  |
  5 | garage        | a_petrol_filling_station                                | Synonym    | fill_station    |
  6 | garage        | a_petrol_filling_station                                | Synonym    | gas_station     |
  7 | progressive   | advancing_in_severity                                   | Antonym    | non_progressive |
  8 | shop          | automobile_mechanic's_workplace                         | Synonym    | garage          |
  9 | electronic    | band                                                    | IsA        | band            |
 10 | cream         | band                                                    | IsA        | band            |

Use Cases and Future Directions


When translated into a graph database, the ConceptNet5 API takes the agony out of tag-based recommendations and categorizations.

Small retail and social startups can integrate a Neo4j microservice into their currently existing stack, using it to power recommendations, provide insights on what is the most effective way to categorize products (should “funny cats” have their own first-level category, or should they go under “animals”?), and allow more time and budget for richer innovations.

References


Loading JSON into a Neo4j Database
Dealing with Empty Columns
Data


Learn how to build a real-time recommendation engine for non-text discovery on your website: Download this white paper – Powering Recommendations with a Graph Database – and start offering more timely, relevant suggestions to your users.