Graph Databases in Life Sciences Workshop
As Bio-Technology is one of the hot topics of the century and graph databases are on the rise in this decade, we thought it would be a good idea to bring researchers and bioinformatics developers together for a workshop about the applicability of graph databases in biological research and application.Fortunately Prof. Lennart Martens a group leader in the Department of Medical Protein Research at VIB and Ghent University offered to host the workshop. So Neo Technology’s Rik Van Bruggen and Lennart Martens organized the workshop and invited a host of attendees from a variety of backgrounds.
26 participants found their way to the picturesque meeting hall of the University of Ghent (a former monastery) to enjoy a full day packed with presentations, discussions and a hands-on workshop. We were greeted by a life size poster of the metabolic interaction pathways in humans.
After the introduction by Lennart and Rik, I ran a quick intro to NOSQL and graph databases in particular and their applicability in a wide range of fields, also with some reference to existing biotech applications.
Thilo Muth who works as a PhD with Lennart works in the area of Metaproteomics an interesting technique about mapping protein fragments to potential bacterial targets and creating meta-proteins on matching groups. He introduced the topic and how they used graph oriented data models to reason about potential mappings.
Pablo Pareja of Oh no sequences! presented Bio4j an open-source research database (and platform) integrating many different sources for protein, genome and taxonomy information. Bio4j also runs on Neo4j and currently holds almost 1 billion relationships. (Slides 1, 2, 3)
In the time until lunch I answered some questions about Neo4j especially about the roadmap, scaling and we highlighted some visualization approaches, like Gephi, Cytoscape and HivePlots.
During the breaks and over lunch we had lots of interesting discussions about life sciences in general, working with scientist and particiular data management problems.
After lunch, Anthony Liekens presented biograph.be, a knowledge discovery system for finding relevant information in the area of life science, e.g. proteins in reactions ranked by their publication relevance. The system employs a page rank algorithm that is implemented using matrix multiplication on a parallel processing system.
Davy Suvee of Janssen Pharmaceutica and datablend.be presented different Graph Database usecases from his experience at a big pharmaceutical company. He closed the presentation with an intro to a time-traveling graph implementation on top of Datomic called FluxGraph.
Thilo then introduced the topic of the workshop “Graph Databases in Life Science” and the “Reactome” database of human protein interaction pathways. He discussed some Neo4j APIs and how they can be used to import the data from flat CSV files into a graph database. The attendees set up their development environment with the Neo4reactome project that we prepared upfront and ran the import successfully.
After importing the data we looked at some use-cases, first visualizing pathways in the Neo4j Web-UI and then running several queries using Neo4j’s query language Cypher to find certain proteins (HBA and HBB) and their interaction pathways.
And example task looked like this:
Find the common pathways of HBA and HBB
Both proteins should be involved in particular pathways, which should be easy to find by querying. Now we want to retrieve only the pathways which have both proteins in common.START proteinA=node:proteins(accession = "P69905"),
proteinB=node:proteins(accession = "P68871")
MATCH (proteinA)-[:INVOLVED_IN]->(pathway)<-[:INVOLVED_IN]-(proteinB)
RETURN pathway
Results
- Metabolism
- O2/CO2 exchange in erythrocytes
- Uptake of Carbon Dioxide and Release of Oxygen by Erythrocytes
- Uptake of Oxygen and Release of Carbon Dioxide by Erythrocytes