A graph data model faithfully mirrors real world structures, right down to the molecular level. A graph database like Neo4j enables data scientists and chemists to analyze molecules to support drug development.
In this week’s five-minute interview (conducted at GraphTour NYC 2019), we speak with Matthew Sellwood, Product Manager at IQVIA about his research into molecular structure using Neo4j.
How do you use Neo4?
So this mainly comes into my hobby area, which is chemistry. In chemistry, molecules are actually graphs of themselves. The best way of modeling them is to actually put them in a graph database, because you can then model the ways that the atoms and the bonds break up, and see differences in how the molecules change as you change different parts and alter the graph of the molecule, sometimes called graph mutation.
When we mutate the graph, we look at which properties of the molecule change. And so, that is best put into Neo4j because we can then start to explore those interactions. We can look at things like, when I take this atom off, what happens to this property of the molecule? This is particularly useful in areas like drug discovery, where you’re looking for ways to improve the solubility of your molecule, because when people take your drug, it should be dissolved in their stomach fluid so it gets naturally distributed throughout their body. This means it needs to be aqueous soluble. If you want to improve solubility, you can look at properties of solubility by saying if I take this part of the graph away, what happens?
Because molecules have lots of similar atoms and subgroups, you have sub-graphs of molecules as well. You can actually start to model things like, if 10 molecules are very similar and they all have the same sub-group, what happens if I change the same sub-group of those 10 molecules?
And you can start to develop these really large networks of molecules and see that actually, even though the molecules are very similar, they have some other properties, and you can start to put up quite a lot of information just by breaking molecules down into the smaller sub-components.
How did people solve this problem before graphs?
Traditionally this data was stored in SQL databases and tables, but that makes it very difficult to see the relationships when you change one part of a molecule.
In chemistry there have been lots of different attempts to model these relationships over the years, and the most successful approaches put them into graph-like structures. But these efforts tend to be quite limited because they were smaller in scope. They were visualization approaches. People have created card views of molecules. Chem uses card views where you have the properties on a card and then some pieces of software connect cards together. You can see these are a bit like nodes, so just got like cards connected with edges and stuff. But this doesn’t scale very well, because the data structures underneath that were still based on SQL. Now, you can put all this data into your graph and you can start to build those visualizations up a lot quicker.
What have been some of the most surprising results you’ve experienced working with graphs?
It comes back to the properties of drug molecules and one of the properties I was looking at in the graph is permeability across membranes. You need your drug molecule to be able to get into your cells as well as to be able to be digested.
When you use the Graph Data Science library, you can actually look at which fragments of the molecule are most highly connected to other fragments and which give you the biggest change in that property. You can look at the most highly connected fragments. You find some really surprising things doing this. You learn some things that a chemist might tell you, but there are other things that the chemist might not have otherwise have seen.
What advice do you have for people who are getting started with graphs?
I would say just give it a go. Maybe have a think about the schema before you use it. The schema is how you represent the data in the graph for your use case. That was one of the things that took the longest time for me to work out. I went through five or six different iterations of a graph schema before I found one that actually made sense and was performant as well. In one schema I added far too many edges and then realized, “Okay, maybe this is not sensible.”
My advice would be to think carefully about the schema, try it out with small datasets and iterate until you find something that scales properly.
Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at email@example.com
Learn about the power of graph algorithms in the O’Reilly book,
Graph Algorithms: Practical Examples in Apache Spark and Neo4j by the authors of this article. Click below to get your free ebook copy.
Get the O’Reilly Ebook