Neo4j Provides Natural Language Processing at Scale, Making Equipment Repair More Efficient
Challenge
Any time a Caterpillar machine is brought in for repair or maintenance, a technician creates a
warranty document that chronicles the complaint, an analysis of the problem and the solution.
There’s a large-scale repository of technical documents, much of which was quite good from
the outlook of labeling and computational linguistics standards. However, there was a lot of
disparate data to connect.
The company recognized there was valuable data housed in more than 27 million documents
and set about creating an NLP tool to uncover these unseen connections and trends.
For the last decade, they had already been exploring NLP for purposes such as vehicle
maintenance and supply chain management. Although a large percentage of the data could be
mapped correctly in some domains, it didn’t mean they could represent this knowledge and
leverage it in a meaningful way.
“We wanted to create a system that would allow someone to ask any type of question as long as
it was in the domain,” said Ryan Chandler, Chief Data Scientist at Caterpillar. “This meant creating
a dialog system to test the use of a graph, demonstrate an open-ended user interface capable of
answering questions and to develop a capability to create spoken human machine interface.”
Solution
Because a graph is the lowest level of structure and provides massive flexibility, graph databases
are a natural fit for language processing and machine learning.
Language processing is often broken down into either dependency structures, which looks at
the verb and draws arcs from the verb to the relationship of the other words relative to the verb,
or it breaks down into a constituency tree. Both of these structures are graphs.
Caterpillar employed Neo4j for graph data structures to create a logical form of knowledge.
This NoSQL alternative to relational databases allowed them to build ontologies and perform
deduction.
To get from natural language to graph query results, the team created data architecture that
ingests text via an open-source NLP toolkit, which uses Python to combine sentences into
strings, correct boundaries and omit “garbage” in the text. Data is also imported from SAP ERP
systems, as well as non-SAP ERP systems.
The Machine Learning Classification tool learns from the portion of data already tagged with
terms such as cause or complaint to apply to the rest of the data.
It uses WordNet as a lexicographic dictionary to provide definitions for the words, the Stanford
Dependency Parser to parse the text and Neo4j to find patterns and connections, build
hierarchies and add ontologies.
Once this is all put together, users can conduct meaningful searches with simple Cypher queries.