Getting Started with the Microsoft Concept Graph in Neo4j


What does the study of concepts (or categories, depending on your field of study) tell us about the human mind?

A result of the Probase research project, the Microsoft Concept Graph harnesses billions of web pages and search logs to build a huge graph of relations between words (like “apple”) and their concepts (like “fruit” or “hardware company”). Using this data, the team at Microsoft hopes to build better search engines, spell-checkers, recommendation engine, taxonomies and more.

This blog post will walk through how we can harness Neo4j to delve into the Single Instance Conceptualization dataset proposed by the first release of the Microsoft Concept Graph in late 2016. Specifically, it will walk through importing the data into Neo4j using neo4j-import and using Cypher to determine when a “apple” means a dessert instead of a particular company. I encourage you to read more in the excellent papers by the Microsoft team here and here.

Concepts embody our knowledge of the kinds of things there are in the world. Tying our past experiences to our present interactions with the environment, they enable us to recognize and understand new objects and events.
– Gregory Murphy, The Big Book of Concepts
Learn how to explore the Microsoft Concept Graph using Neo4j and the Cypher graph query language

The question is then: How do we pass human concepts to machines, and how do we enable machines to conceptualize?
Microsoft Concept Graph

The Model

Concepts, Instances, and IS_A Relationships

The first release of the Microsoft Concept Graph can be easily summarized as a set of Instance vertices connected to a set of Concept vertices by weighted IS_A edges. Or, in Neo4j terms, Instance nodes connected to Concept nodes by IS_A relationships containing a probability property denoting the possibility of the Instance belonging to the Concept. As a result, the relationships between an instance and its concepts shows the its distribution over the concept vector space. More scoring functions are included in the datasets’ API.

In the dataset, Instances are English noun phrases (NPs) and Concepts are the mental bucket or category the NP may belong to. For example, instances of the concept snake includes the words “boa,” “python,” and “viper,” which are also instances of the concepts of artist (p=0.128), language (p=.557), and car (p=.107), respectively.

Download & Import: The V1 Release

Download link for the Microsoft Concept Graph:

This first release, called Single Instance Conceptualization, provides the core Is_A data mined from billions of web pages. It contains 5,376,526 unique concepts, 12,501,527 unique instances, and 85,101,174 Is_A relations.

The data is in a single tab-separated file, 330MB zipped and 1.2GB uncompressed, which we can import with neo4j-import (so make sure you’re using the .tar version of Neo4j).

The data in the file is organized according Concept, Instance and Probability, like so:

state california 18062
supplement msm glucosamine sulfate 15942
Important: Note that the probability is out of 10^4.

This is a relatively simple graph can be represented like so:

  • Concept
  • Instance
  • Concept
  • IS_A
  • IS_A
# a quick peek at the data
head -n 10 data-concept-instance-relations.txt

factor	age	35167
free rich company datum	size	33222
free rich company datum	revenue	33185
state	california	18062
supplement	msm glucosamine sulfate	15942
factor	gender	14230
factor	temperature	13660
metal	copper	11142
issue	stress pain depression sickness	11110
variable	age	9375

# extract concepts (this can take a few seconds)
$ echo "name:ID(Concept)" > concepts.txt
$ cat data-concept-instance-relations.txt | cut -d $'\t' -f 1 | sort | uniq >> concepts.txt
# extract instances (this can take a few seconds)
echo "name:ID(Instance)" > instances.txt
cat data-concept-instance-relations.txt | cut -d $'\t' -f 2 | sort | uniq >> instances.txt

# create the header row for the relationships import
echo $':END_ID(Concept)\t:START_ID(Instance)\tprobability' > is_a.hdr

# import into Neo4j
$NEO4J_HOME/bin/neo4j-import --into concepts.db --id-type string 
--delimiter TAB --bad-tolerance 100000 --skip-duplicate-nodes true 
--skip-bad-relationships true --nodes:Concept concepts.txt 
--nodes:Instance instances.txt 
--relationships:IS_A is_a.hdr,data-concept-instance-relations.txt


IMPORT DONE in 1m 27s 888ms.
  17878053 nodes
  33377320 relationships
  51255373 properties
Peak memory usage: 410.36 MB

# Add two Constraints/Indexes
echo $'
                     | $NEO4J_HOME/bin/neo4j-shell -path concepts.db

Now that you’ve created the concepts.db graph, you can move it to $NEO4J_HOME/data/databases and update $NEO4J_HOME/conf/neo4j.conf to mount concepts.db:

# The name of the database to mount

You should now be able to start the Neo4j Browser and see the Concept Graph.

Let’s Explore the Concept Graph

How is the word “apple” represented in the concept space?

MATCH (i:Instance {name:"apple"})-[r:IS_A]->(c:Concept)
RETURN AS Instance, tofloat(r.probability)/10000 
              AS `is a(n)`, AS Concept
ORDER BY r.probability DESC

| Instance |is a(n)        | Concept        |
| "apple"  | 0.6315        | "fruit"        |
| "apple"  | 0.4353        | "company"      |
| "apple"  | 0.1152        | "food"         |
| "apple"  | 0.764         | "brand"        |
| "apple"  | 0.750         | "fresh fruit"  |
| "apple"  | 0.568         | "fruit tree"   |
| "apple"  | 0.483         | "crop"         |
| "apple"  | 0.280         | "corporation"  |
| "apple"  | 0.279         | "manufacturer" |
| "apple"  | 0.257         | "firm"         |

How is the word “pie” represented in the concept space?

MATCH (i:Instance {name:"pie"})-[r:IS_A]->(c:Concept)
RETURN AS Instance, tofloat(r.probability)/10000 
              AS `is a(n)`, AS Concept
ORDER BY r.probability DESC

| Instance | is a(n) | Concept       |
| "pie"    | 0.0256  | "food"        |
| "pie"    | 0.0245  | "dessert"     |
| "pie"    | 0.0208  | "baked goods" |
| "pie"    | 0.018   | "bakery item" |
| "pie"    | 0.0105  | "baked good"  |
| "pie"    | 0.0097  | "item"        |
| "pie"    | 0.0087  | "product"     |
| "pie"    | 0.0054  | "food item"   |
| "pie"    | 0.0041  | "sweet"       |
| "pie"    | 0.0041  | "dish"        |
10 rows
9321 ms

Adding some context: What Concepts represent both an apples and a pie?

We want to be very sure we’re talking about apple in the sense of the food, not Apple in the sense of the company.

MATCH (a:Instance {name:"apple"})-[r1:IS_A]->(c:Concept)<-[r2:IS_A]-(b:Instance {name:"pie"})
USING INDEX a:Instance(name)
USING INDEX b:Instance(name)
RETURN AS Concept, tofloat(r1.probability)*tofloat(r2.probability)*10^-8 AS prob

| Concept     | prob                  |
| "food"      | 0.00294912            |
| "item"      | 2.4056000000000001E-4 |
| "product"   | 1.5747E-4             |
| "fruit"     | 6.315E-5              |
| "snack"     | 3.959E-5              |
| "food item" | 3.51E-5               |
| "dessert"   | 3.43E-5               |
| "name"      | 7.92E-6               |
| "dish"      | 4.92E-6               |
| "case"      | 3.92E-6               |

10 rows
15 ms

Adding some context: What instances are similar to both apples and pies?

We can even go further and check the instances of those concepts and aggregate by them, instead just the relations stored on the IS_A relationship, allowing us to deduce that things that are both apples and pies are bread-like fruit-based cakes.

MATCH (a:Instance {name:"apple"})-[:IS_A]->(c:Concept)<-[:IS_A]-(b:Instance {name:"pie"})
USING INDEX a:Instance(name)
USING INDEX b:Instance(name)
MATCH (c)<-[:IS_A]-(o:Instance) WHERE o <> a and o <> b
WITH o, count(*) AS freq
RETURN AS Instance, freq;

| Instance    | freq |
| "bread"     | 115  |
| "fruit"     | 113  |
| "cake"      | 110  |
| "cookie"    | 109  |
| "chocolate" | 102  |
| "cheese"    | 99   |
| "vegetable" | 93   |
| "egg"       | 93   |
| "banana"    | 91   |
| "fish"      | 91   |
10 rows
4900 ms


Although the Microsoft Concept Graph is a currently a bit more sparse than other concept graphs online, the research that created it is a valuable addition to the study of taxonomy and language.


    • Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, in ACM International Conference on Information and Knowledge Management (CIKM), ACM – Association for Computing Machinery, October 2015.
    • Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Zhu, Probase: A Probabilistic Taxonomy for Text Understanding, in ACM International Conference on Management of Data (SIGMOD), May 2012.

Want to explore more graph datasets like this one? Get your free copy of the Learning Neo4j ebook and catch up to speed with the world’s leading graph database.

Learn Neo4j Today