Novartis Captures the Latest Biological Knowledge for Drug Discovery
Challenge
Novartis has amassed decades of data on how various compounds affect protein targets,
such as enzymes, with about a billion data points in all. That historical data is critical but
sparse compared with the amazing granularity of the data currently being collected.
Today, Novartis uses an automated process that captures high-content image data showing
how a particular compound has affected an entire cell culture. This generates terabytes of
phenotypic data.
Novartis faced the challenge of combining its historical data stores with this burgeoning
phenotypic data. They also needed a way to place all this data within the larger context of
ongoing medical research from around the world.
The Novartis team wanted to combine its data with medical information from NIH’s PubMed.
PubMed contains about 25 million abstracts from some 5,600 scientific journals.
The Novartis team sought a way to empower researchers to ask questions connecting the dots between all of this data in the context of the latest medical research.
As Stephan Reiling, Senior Scientist at Novartis, put it, “When we try to analyze this data, it
becomes much more apparent that we need to have a way to store biological knowledge and
then run queries against it.”
Solution
Ingesting and connecting data about diseases, genes and compounds – along with identifying
the nature of the relationships between these elements – held the promise of accelerating
drug discovery.
The Novartis team wanted to link genes, diseases and compounds in a triangular pattern.
“For successful drug discovery, you need to be able to navigate this triangle,” explained
Reiling. The Novartis team decided to create a knowledge graph stored in Neo4j, and devised
a processing pipeline for ingesting the latest medical research.
Text mining is used at the beginning of the pipeline to extract relevant text data from
PubMed. That data is then fed into Neo4j, along with Novartis’s own historical and image
data. The data pipeline populates the 15 kinds of nodes that were devised to encode the
data. The next phase fills in the relationship information that links the nodes together. The
team identified more than 90 different relationships.
Novartis uses Neo4j graph algorithms to traverse the graph and identify a desired triangular
node pattern linking the three classes of data together. Graph analytics not only find relevant
nodes in the desired triangular relationship, but also employ a metric the team designed to
gauge the associated strength between each node in each triangle. Using this capability, the
team devised queries to find data linked by the desired node pattern, with a given association
strength, and then sort the triangles according to this metric.
When researchers query the knowledge graph, results show the strength of the correlation
between elements. If a researcher already knows about a highly associated correlation, they
might choose to investigate others, which could take their work in new directions.