Novartis Captures the Latest Biological Knowledge for Drug Discovery
Novartis has amassed decades of data on how various compounds affect protein targets, such as enzymes, with about a billion data points in all. That historical data is critical but sparse compared with the amazing granularity of the data currently being collected.
Today, Novartis uses an automated process that captures high-content image data showing how a particular compound has affected an entire cell culture. This generates terabytes of phenotypic data.
Novartis faced the challenge of combining its historical data stores with this burgeoning phenotypic data. They also needed a way to place all this data within the larger context of ongoing medical research from around the world.
The Novartis team wanted to combine its data with medical information from NIH’s PubMed. PubMed contains about 25 million abstracts from some 5,600 scientific journals.
The Novartis team sought a way to empower researchers to ask questions connecting the dots between all of this data in the context of the latest medical research.
As Stephan Reiling, Senior Scientist at Novartis, put it, “When we try to analyze this data, it becomes much more apparent that we need to have a way to store biological knowledge and then run queries against it.”
Ingesting and connecting data about diseases, genes and compounds – along with identifying the nature of the relationships between these elements – held the promise of accelerating drug discovery.
The Novartis team wanted to link genes, diseases and compounds in a triangular pattern. “For successful drug discovery, you need to be able to navigate this triangle,” explained Reiling. The Novartis team decided to create a knowledge graph stored in Neo4j, and devised a processing pipeline for ingesting the latest medical research.
Text mining is used at the beginning of the pipeline to extract relevant text data from PubMed. That data is then fed into Neo4j, along with Novartis’s own historical and image data. The data pipeline populates the 15 kinds of nodes that were devised to encode the data. The next phase fills in the relationship information that links the nodes together. The team identified more than 90 different relationships.
Novartis uses Neo4j graph algorithms to traverse the graph and identify a desired triangular node pattern linking the three classes of data together. Graph analytics not only find relevant nodes in the desired triangular relationship, but also employ a metric the team designed to gauge the associated strength between each node in each triangle. Using this capability, the team devised queries to find data linked by the desired node pattern, with a given association strength, and then sort the triangles according to this metric.
When researchers query the knowledge graph, results show the strength of the correlation between elements. If a researcher already knows about a highly associated correlation, they might choose to investigate others, which could take their work in new directions.