Neo4j life sciences & healthcare workshop – proceedings from Berlin

September 29, 2017

9 min read

Often invisible to the people outside of the field, life science researchers have been quietly embracing graph databases instead of the traditional triple and relational stores.

On June 21, we invited a group of life science and healthcare researchers and practitioners to Berlin to share their experiences in a full-day workshop.

Neo4j life sciences and healthcare workshop

And a full day it was: 11 planned and three ad-hoc presentations and two longer workshops in the afternoon, covering everything from genome-, proteome-, pathway- and systems-biology model databases and interactions to actual drug development efforts and plans for improving healthcare, all with the help of graph databases.

We were stunned by the breadth of applications and by the enthusiasm and happiness of the presenters with graph technology.

Thank you so much to everyone for presenting, discussing and attending. We will definitely run similar events elsewhere (North America) and possibly co-located with industry and research conferences.

Daniela Butano from InterMine.org shared her observations here.

It took us a bit of time over the summer to collect the information from all presenters into a proper proceedings publication, which we are proud to publish today:

Proceedings: Neo4j Life Sciences & Healthcare Workshop, Berlin, 21 June 2017

To capture all the research and development activity in this space, we created a dedicated page listing projects and publications that utilize Neo4j. As part of our efforts we also offer a “Life Sciences & Healthcare Accelerator Program” to support researchers and institutions to get started with using Neo4j and the appropriate licenses.

Please reach out to us and submit your papers, publications or interest for that resource collection, so others can contact you about your work.

Hoping to see you at the next workshop of this kind,

–Petra Selmer & Michael Hunger

Workshop program

To give you an impression of the breadth and depth of presentations, here is the list of topics and presenters that we enjoyed listening to and discussing with during the workshop; you can find the full synopsis and slides of each talk in the proceedings.

Big Data in Genomics: How Neo4j enables personalized therapies
Martin Preusse (Knowing, Helmholtz Zentrum Munich)
Biomedical research generates vast amounts of data. New experimental technologies like DNA sequencing, metabolomics and proteomics drive the fast growth of available information and lead to a better understanding of the molecular organization of life.
Graph databases for biomedical ontologies
Simon Jupp (European Bioinformatics Institute)
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration, a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
Using Neo4j for the management of systems biology models
Dagmar Waltemath, Ron Henkel (Systems Biology & Bioinformatics, Univ. Rostock)
The characteristics of graph databases make them a natural choice for many applications in the life sciences, specifically in computational biology. Modeling and analysis of biological networks has become a necessary craft for biomedical researchers. Such networks contain a large number of biological entities (nodes) and their interactions (relations). Consequently, graph databases and graph query languages are prominent technologies to support life science research today. In our talk, we will provide examples for the the use of graph databases in the field of systems biology, focusing on the storage, integration and retrieval of models, simulation experiments and pathway data. We will furthermore highlight challenges that we face in our own research, when performing graph-similarity measures on XML-encoded simulation models.
Questions & Answers: Neo4j as a tool in Network Biology
Georg Summer (Maastricht University)
Biology is complex. Diseases complicate it further. The data collected and measured is complicated. If we were to understand and comprehend this complexity at its fullest, dissecting and – ultimately – modulating biological systems is within our grasp. Yet we are only at the beginning of this journey, and all journeys need equipment. Systems and network biology is one of these tools to handle the information and data we have in front of us. We developed a set of software tools around Neo4j to build networks for biological problems as needed.
Proteomics & Graph Databases: the symbiosis of associations
Alejandro Brenes Murillo (Centre for Gene Regulation & Expression, Dundee, Scotland)
The proteome is the entire set of proteins that are produced or modified by an organism, it is an element that varies with time, stress, environmental conditions or distinct requirements that a cell might have. This talk explores how graph databases can be useful for proteome analysis.
Extending the MPA graph database structure for comparing multiple metaproteomics samples using label-free quantification
Thilo Muth & Kay Schallert (Robert Koch institute Berlin)
Metaproteomics, the mass spectrometry-based analysis of multi-species proteins from microbial samples, faces enormous challenges concerning analysis and interpretation of the data. To overcome these issues, we have developed and published the scientific research software MetaProteomeAnalyzer (MPA) in the recent past. The software is an intuitive open source tool for metaproteomics data analysis and interpretation, which includes multiple database search engines for protein identification and the feature to decrease data redundancy by grouping protein hits to so-called meta-proteins.
Prioritizing SNPs using the Neo4j Galaxy Interactive Environment
Thoba Lose (South African National Bioinformatics Institute)
Graph database implementations are increasingly being used within the biomedical research space, e.g., disease network underpinned by a protein and metabolic framework (diseaseknowledgebase). We previously developed a `neostore` datatype and a Neo4j interactive environment for storing and exploring Neo4j graph databases within the Galaxy scientific workflow system. Building on this work we generate a M. tuberculosis (Mtb) genomic database from multiple sources of annotation.
Tabloid Proteome: web of associated protein pairs, derived from mass-spectrometry based proteomics experiments
Surya Gupta (Medical Protein Research at VIB / Univ. Ghent)
We have built a protein association database, using the Neo4j graph platform, which includes protein association derived from our analysis method. Together with the analysis information, it also includes all possible biological relations for these elements derived from existing knowledge bases. We have used graph algorithms and queries to provide more information about the protein association, which is otherwise not available in existing databases.
Graph databases for antibody development research
Pavel Yakolev (Biocad St. Petersburg)
By the beginning of 2017, there are about 120000 protein structures in the Protein Data Bank. Antibody structures are about 3000 of them, and there are also a few thousand structures of other immune receptors and PDZ-domains. This forms a significant amount of data for protein structure and function prediction. In this talk, we present our method to normalize these structures and their features using graph databases. Also we will talk about useful universal interfaces for human and automated analytics of the data.
Substantially improving healthcare outcomes and costs using a graph paradigm
Pieter van den Berg (Medical Professional Services Nederland)
A very important part of the job of healthcare professionals is applying their professional judgement. It means taking decisions under conditions of uncertainty using incomplete and somewhat unreliable information on a case-by-case basis. Unfortunately, people are bad at this because of the way our brains work. At the core of our efforts is the development of a medical decision support system that helps professionals make decisions under conditions of uncertainty and based on incomplete and possibly unreliable information. This can (only) be done within a graph paradigm.
Graph Exploration: Taking the User into the Loop
Davide Mottin (Hasso Plattner Institute Potsdam (HPI))
The increasing interest in social networks, knowledge graphs, protein-interaction, and many other types of networks has raised the question how users can explore such large and complex graph structures easily. Current tools focus on graph management, graph mining, or graph visualization but lack user-driven methods for graph exploration. In many cases graph methods try to scale to the size and complexity of a real network. However, methods miss user requirements such as exploratory graph query processing, intuitive graph explanation, and interactivity in graph exploration. While there is consensus in database and data mining communities on the definition of data exploration practices for relational and semi-structured data, graph exploration practices are still indeterminate.
The Reactome Graph Database: Efficient Access to Complex Data Structures
Antonio Fabregat Mundo (European Bioinformatics Institute)
Reactome is a free, open source, curated and peer-reviewed knowledge base of biomolecular pathways that provides infrastructure and intuitive bioinformatics tools for search, visualisation, interpretation and analysis of pathways. The benefit of storing these data in their natural form is that there is no need to be transformed into a flat table format but instead, can be persisted as originally designed. Adopting Neo4j as the graph database management system helps reducing the complexity of the database and, thus, allows a more straightforward access to the Reactome knowledgebase via its query language, Cypher.
Exploring graph databases for biological data models in InterMine
Daniela Butano (Intermine)
InterMine is an open source data warehouse built for the integration and analysis of large-scale biological datasets. Developed at the University of Cambridge in 2002, InterMine currently has dozens of instances around the world covering a broad range of biomedically-relevant organisms, bacteria, and plant life. InterMine is based on the open source RDBMS PostgreSQL, which forces all data to be modelled in tables; graph databases seem more suited to naturally modelling the network shape of biological data.
Workshop: Data modeling for systems medicine with Neo4j
Martin Preusse (Knowing, Helmholtz Zentrum Munich)
In this workshop we discuss graph models for biological systems. We cover aspects such as multi-level data modeling, integration from genome to phenotype and how we can use Neo4j to enhance analyses from GO enrichment to statistical models.
Workshop: Integrating linked life science data sources into a graph model in Neo4j
Simon Jupp (European Bioinformatics Institute)
Biological data is distributed in many databases, some of which offer data access, API or dumps. In this session we will demonstrate how we can query public linked data endpoint and rapidly integrate the results into a Neo4j property graph.

Want to learn more about graph databases and Neo4j? Click below to register for our online training class, Introduction to Graph Databases and master the world of graph technology in no time.

Get Started