Graph database

Graph databases in life sciences workshop

November 19, 2012

4 min read

Update: We published the Proceedings of the last Life Sciences and Healthcare workshop in Berlin.

Graph databases in life sciences workshop

As Bio-Technology is one of the hot topics of the century and graph databases are on the rise in this decade, we thought it would be a good idea to bring researchers and bioinformatics developers together for a workshop about the applicability of graph databases in biological research and application.

Fortunately Prof. Lennart Martens a group leader in the Department of Medical Protein Research at VIB and Ghent University offered to host the workshop. So Neo Technology’s Rik Van Bruggen and Lennart Martens organized the workshop and invited a host of attendees from a variety of backgrounds.

26 participants found their way to the picturesque meeting hall of the University of Ghent (a former monastery) to enjoy a full day packed with presentations, discussions and a hands-on workshop. We were greeted by a life size poster of the metabolic interaction pathways in humans.

After the introduction by Lennart and Rik, I ran a quick intro to NOSQL and graph databases in particular and their applicability in a wide range of fields, also with some reference to existing biotech applications.

Thilo Muth who works as a PhD with Lennart works in the area of Metaproteomics an interesting technique about mapping protein fragments to potential bacterial targets and creating meta-proteins on matching groups. He introduced the topic and how they used graph oriented data models to reason about potential mappings.

Pablo Pareja of Oh no sequences! presented Bio4j an open-source research database (and platform) integrating many different sources for protein, genome and taxonomy information. Bio4j also runs on Neo4j and currently holds almost 1 billion relationships. (Slides 1, 2, 3)

In the time until lunch I answered some questions about Neo4j especially about the roadmap, scaling and we highlighted some visualization approaches, like Gephi, Cytoscape and HivePlots.

During the breaks and over lunch we had lots of interesting discussions about life sciences in general, working with scientist and particiular data management problems.

After lunch, Anthony Liekens presented biograph.be, a knowledge discovery system for finding relevant information in the area of life science, e.g. proteins in reactions ranked by their publication relevance. The system employs a page rank algorithm that is implemented using matrix multiplication on a parallel processing system.

Davy Suvee of Janssen Pharmaceutica and datablend.be presented different Graph Database usecases from his experience at a big pharmaceutical company. He closed the presentation with an intro to a time-traveling graph implementation on top of Datomic called FluxGraph.

Thilo then introduced the topic of the workshop “Graph Databases in Life Science” and the “Reactome” database of human protein interaction pathways. He discussed some Neo4j APIs and how they can be used to import the data from flat CSV files into a graph database. The attendees set up their development environment with the Neo4reactome project that we prepared upfront and ran the import successfully.

After importing the data we looked at some use-cases, first visualizing pathways in the Neo4j Web-UI and then running several queries using Neo4j’s query language Cypher to find certain proteins (HBA and HBB) and their interaction pathways.

And example task looked like this:

Find the common pathways of HBA and HBB

Both proteins should be involved in particular pathways, which should be easy to find by querying. Now we want to retrieve only the pathways which have both proteins in common.

    START proteinA=node:proteins(accession = "P69905"),     
    proteinB=node:proteins(accession = "P68871") 
    MATCH (proteinA)-[:INVOLVED_IN]->(pathway)<-[:INVOLVED_IN]-(proteinB) 
    RETURN pathway

Results

Metabolism
O2/CO2 exchange in erythrocytes
Uptake of Carbon Dioxide and Release of Oxygen by Erythrocytes
Uptake of Oxygen and Release of Carbon Dioxide by Erythrocytes

After the workshop the discussions continued over a broad range of topics.

I want to thank again Lennart Martens, Thilo Muth and Rik Van Bruggen for organizing such a great workshop. And of course Pablo Pareja, Davy Suvee and Anthony Liekens for presenting.

We started a “neo4j-biotech” google group some weeks ago, and would like to invite everyone to join this discussion forum to engage in conversations in the biotech domain with colleagues that have the same background and vocabulary.

Cheers,

Michael Hunger, Neo4j Community Team