Graph Database Neo4j is the Root of the Tree of Life

A new initiative aims to build a comprehensive tree of life that brings together everything scientists know about how all species are related, from the tiniest bacteria to the tallest tree. Researchers are working to provide the infrastructure and computational tools to enable automatic updating of the tree of life, as well as develop the analytical and visualization tools to study it. Scientists have been building evolutionary trees for more than 150 years, since Charles Darwin drew the first sketches in his notebook. Our group is faced with the challenge of gathering scattered research on some two million known species, placing them and their associated data on a single evolutionary tree of life and then providing a way for new species to be added by researchers around the world. (hence the “Open” in our name). Traditional data storage and software are simply not up to the task. With any number of scientists and researchers contributing their work to the Open Tree, standardization is key. The largest existing evolutionary trees contain around 100,000 species. Creating a system to include twenty times that amount means redefining how and what types of data storage are used.      This is no simple task. For instance, what are the criteria that should be used to distinguish species? Is it genetic material? Is it a variation in certain features? For species that are extinct, like early mammals or dinosaurs, we don’t have enough genetic material to say for certain where those species fit in the tree. Does morphology (the animal’s form or shape) then come into play? Terminology must be standardized. One research team might store their data under one scientific name, while another might use a completely different one.

     As we work to resolve these issues, a key factor in successfully assembling a comprehensive tree is finding unique and intuitive ways to illustrate relationships between the species. Rick Ree (Field Museum of Natural History, University of Chicago), Stephen Smith (University of Michigan), and Mark Holder (University of Kansas) from our group are using something called “graph database” technology to organize the big data associated with the tree.

     A graph database can better highlight how one species is connected to another. Facebook and Twitter use them. Earlier forms of data storage were very limiting. For instance, if you wanted to store information about your very interesting friend “Linda,” you were restricted to the data fields provided, such as her email, phone number, address, interests, and the like. If you had another friend, “Dave,” his information would be limited to the same data fields, and isolated from Linda’s.

     However, as we’ve seen in social media, there are new and exciting ways to store all sorts of data types – AND to connect them with each other. So now, not only can you access Dave and Linda’s information, you can see the relationships of your other friends to them and they can see their shared relationships, too.

     This type of approach to data management is perfect for the Open Tree of Life. With some editing and specializing of the types of data stored, the nodes and relationships that worked so well in social media will now be able to store information about the different species. Looking up a species of bear will not just result in information about that bear, but also similar bear species, its most recent ancestor to those other species, and even connections to the very first bear species. 

     These big data methods also make it much easier for future species to be added to the tree. With over ten million species left to discover and identify, the ability to expand the Open Tree is critical. Graph database technology will allow researchers and scientists to make those changes easily, without compromising the rest of the tree.

. Neo4j is the graph database that forms the back end of the Open Tree of Life.