Tell us a little bit about your role and your current project with Neo4j at Monsanto.
I’m a data scientist at Monsanto. My role focuses on coming up with clever ways to help both myself and other people at the company get better inferences out of our genomic datasets.
Can you tell us a little bit about what you mean by genomic datasets?
Our company develops seed products that are sold in 160 different countries, and we back those products with extensive R&D pipelines designed to make sure that we can make the best possible seed developed to be grown in the best possible environment. And we do that by gathering a lot of data.
In that process, we gather a lot of genomic sequence data, and we crunch a lot of data, and we use that to try to make better selections on what products should move forward, to continually improve the genetic performance of our seed products.
Your talk today had the title, “Graphs are Feeding the World.” Tell us a little bit more about what you covered.
If you look across where we’re going to be between now and a couple decades from now, by 2050 there are a lot of good estimates produced by the UN that suggest that we’re going to have a world population of over 9.5 billion people. And those 9.5 billion people compete for a lot of resources. So, it’s more people to feed, to start.
As those people expand into cities and they leave behind rural lifestyles, that expanding footprint comes with the expense of arable farmland and that growing middle class also has acquired a taste for diets that are higher in animal-based protein. All of these things require significant intensification of agricultural production to sustain.
The one way that humanity has become really good at that is by continually developing better agricultural products. We know that over thousands of years we’ve been able to produce crops that have gone from tiny to much larger today – I used examples in my talk of really tiny ancestor plants that you would never think of from 10,000 years ago to the modern corn that we all know.
Inside of our pipeline we do the same thing at a much larger scale, so we’re continually looking at what are the best possible plant varieties out there in the world? What genetic traits do they have that make them, say, grow very well or need less water or need less fertilizer, or withstand really dry air conditions?
We crossed those plants together and screened their progeny genetically to determine which ones we want to keep, and that’s how we determine what goes into the next bag of seeds that we sell a farmer. And every year that we do that, every product grows a little bit more and takes a little bit less to grow, and that’s how we move the needle.
Can you tell me about the problem that you were solving and how you decided on Neo4j?
There’s a lot of really cool genetic problems that have been in existence for a while that rely on you being able to treat your dataset as an ancestor family tree. So our datasets are no different than the dataset that you might get if you signed up for ancestry.com and look at your family tree. Same thing, just a lot more relatives.
But analyzing those family trees quickly became untenable.
We’ve been a plant breeding company for about 20 years, and we inherited datasets going back 60 years. For a while, we dealt with that dataset in a very classical Oracle relational database mindset, a very classical outlook for a company of our size.
But the problem is that every question that we would want to ask as we try to employ more modern data analysis techniques needed lots of real-time analysis to be run, but it would take us seconds to minutes to hours to perform one round of analysis, and that doesn’t scale.
We’re fond of saying that you can develop the coolest algorithm in the world but if you can’t run it over the scale of an entire pipeline, it doesn’t move the needle at all. It doesn’t move the needle for us.
But then we discovered that our dataset naturally matches a graph. It was really easy to model into a graph, and therefore, really easy to write these algorithms into graph queries.
Once we did, analysis that used to take minutes or hours, took seconds. This was really cool because then we could do it across everything. So I could ask the same question of several million objects instead of one at a time. That freed us up to make some really cool abstractions around important genetic features.
For example, what are our most popular ancestors? What ancestors continually form the crosses that are the most productive? Those queries tell us something about which plant varieties are the most productive, and that means that we want to spend more time researching those and less time researching the duds.
So then how did you find Neo4j?
We found Neo4j through some simple mailing lists and Google searches. After we kicked the tires a bit, we ended up doing a more extensive evaluation of Neo4j against Titan.
We settled upon Neo4j for reasons related to two main things:
One, our analysis is heavily real-time dependent, so we need to be able serve up these queries really, really quickly. This meant the horizontal scaling across identical copies of the data, and guaranteed index-fee adjacency is a big deal.
Second, our dataset is heavily factual-based, and the example that I use for that is if you consider the classical graph problem – the social networking problem – and you want to split your graph across a bunch of nodes and recommend new friends for yourself, you probably wouldn’t notice if some of your friends were missing from the network. You would never notice it.
But, in our case, if we’re trying to run a genetic analysis, and I give two different data scientists different answers for the parents, we’re going to have a problem. It was that guarantee (of returning all of the info) that we had on Neo4j that we didn’t have on another platform. Our ACID consistency model was such that we needed those guarantees.
What kind of results has Neo4j brought you?
What we ended up doing with Neo4j in this particular case is we built a platform around it, and in the true sense of the word. We have this wonderful Neo4j cluster with all this genetic history data. We built a really rich API over it that allows our geneticists and our app developers to speak and execute complex algorithms in really simple terms. And that has created an ecosystem at our company.
That system has been in production for a little over two years now. It was right before GraphConnect 2014, and there are now about 120, 130 different applications and data scientists that have built stuff over it. So, my team acts as force multipliers, so those are people that are then taking our datasets and building their own applications and their own data science routines out of it.
That group of people has made about 700 million REST requests so far. While those numbers aren’t particularly staggering if we were an external facing web company, that’s all internal data science being done, and that’s been very transformative for us.
Is Neo4j standing alone in this project, or did you keep the legacy infrastructure Neo4j is working with? How did you deal with all that?
That’s a fantastic question because I think that one of our challenges is that ultimately we’re paid to keep the lights on. I like to say that we built a really sexy, awesome data store, and we gave people access to it, but the reality is that it would not be real time if the data was stale the minute we turned it on.
We’re in a situation in our company where we do have a lot of big data – like large data monoliths with a lot of important but old applications writing for them and we can’t simply go to those development teams and say, “You have five days to convert over to writing your data to us.” I don’t know if you’ve ever dealt with high strung project managers but that wouldn’t play very well. So the way that we worked around it is we actually let those apps live for a while.
We’ve slowly started moving all of our new apps over to read from Neo4j, and then to handle the write situation, we actually built a streaming pipeline from our Oracle Exadata monolith out to Neo4j. And we accomplished that using a little bit of Oracle technology, so we utilized their GoldenGate platform to broadcast a change stream, and we wrote a custom coded adaptor that feeds that change stream into an Apache Kafka cluster.
We used that Kafka cluster to feed our graph, and the net effect of that is within a second or two of a write from a legacy app coming into Exadata, with no work done from that team, we reflect that update in the graph database.
How have you found it to develop in Neo4j, particularly in a somewhat complicated environment?
Very well. One thing I like to say is that at least for our domain because it maps so well to a graph structure, there’s no impedance mismatch.
If you were to look at our code base and the algorithms that we’ve written, and you were to also have side by side with you any genetics textbook that you would find in a genetics Ph.D. program, all the patterns that we’ve coded algorithms for natively show up in those textbooks. They’re known patterns of thought, which means it’s really easy when I can start expressing those patterns in a language that better matches how those patterns fit both my mental model and the model of those who are going to consume them, so it’s like having a well-fitting library for the job.
What lessons have you learned? What would you do differently if you could begin again?
I really, really wish that we hadn’t had to deal with the sync problem. It was harder to build up that streaming pipeline than it was to build the graph database itself. But that was a good lesson in of itself that you can’t get around.
One thing that I think that I would do over, or rather that I think we’ve learned a lot from, is the way that we went about modeling and coding things. I think that we definitely learned that right away in our first trip to the whiteboard. We thought that we had it all figured out, but it wasn’t an efficient graph model.
The beautiful, yet complex, things about graph data modeling is that there’s no right way to do it except when you pick the wrong way. So we went through about a good month where we thought we had a good model, then we would benchmark it and realize it was not performing very well, and we had to iterate over and over. Looking back, I would have allowed more time for that iteration. Because that was very valuable about understanding our access patterns.
What else? The same thing for our traversals. I think we took a very iterative approach to developing our algorithms as far as what we did via Cypher, and what we did via the traversal framework. And I think I would do that all over again. But now I would know what I know now about how to be very methodical about doing that.
Is there anything else you’d like to add or say?
I’m a scientist by training – a life scientist – and it has been fascinating to me the number of neat use cases for graph analysis and graph databases in the life sciences that no one’s working on.
In one part of my presentation, I showed some neat examples from literature mining that I’ve done. Going all the way back to 1921, there was this beautiful textbook illustration of someone showing a family tree of chickens, actually, where he had mapped out a graph-based algorithm for saying, “If these are all my ancestors, this is how each one contributed to me genetically.”
That was cool because in the 1920s we had no idea what modern molecular biology was, and, more importantly, there was no graph database. But he was describing this algorithm that it actually turns out works really well. Right now, we’re trying to fork that over into our Neo4j instance. That’s really cool, I think.
I would like to see more developers with this skill set get really excited about working in a problem space that they might have considered off limits to them. The reality is that it’s not at all. The barrier to entry is very low.
Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at email@example.com.