Neo4j Provides Natural Language Processing at Scale, Making Equipment Repair More Efficient
The ChallengeAny time a Caterpillar machine is brought in for repair or maintenance, a technician creates a warranty document that chronicles the complaint, an analysis of the problem and the solution.
There’s a large-scale repository of technical documents, much of which was quite good from the outlook of labeling and computational linguistics standards. However, there was a lot of disparate data to connect.
The company recognized there was valuable data housed in more than 27 million documents and set about creating an NLP tool to uncover these unseen connections and trends.
For the last decade, they had already been exploring NLP for purposes such as vehicle maintenance and supply chain management. Although a large percentage of the data could be mapped correctly in some domains, it didn’t mean they could represent this knowledge and leverage it in a meaningful way.
“We wanted to create a system that would allow someone to ask any type of question as long as it was in the domain,” said Ryan Chandler, Chief Data Scientist at Caterpillar. “This meant creating a dialog system to test the use of a graph, demonstrate an open-ended user interface capable of answering questions and to develop a capability to create spoken human machine interface.”
The SolutionBecause a graph is the lowest level of structure and provides massive flexibility, graph databases are a natural fit for language processing and machine learning.
Language processing is often broken down into either dependency structures, which looks at the verb and draws arcs from the verb to the relationship of the other words relative to the verb, or it breaks down into a constituency tree. Both of these structures are graphs.
Caterpillar employed Neo4j for graph data structures to create a logical form of knowledge. This NoSQL alternative to relational databases allowed them to build ontologies and perform deduction.
To get from natural language to graph query results, the team created data architecture that ingests text via an open-source NLP toolkit, which uses Python to combine sentences into strings, correct boundaries and omit “garbage” in the text. Data is also imported from SAP ERP systems, as well as non-SAP ERP systems.
The Machine Learning Classification tool learns from the portion of data already tagged with terms such as cause or complaint to apply to the rest of the data.
It uses WordNet as a lexicographic dictionary to provide definitions for the words, the Stanford Dependency Parser to parse the text and Neo4j to find patterns and connections, build hierarchies and add ontologies.
Once this is all put together, users can conduct meaningful searches with simple Cypher queries.