By Thomas Kelder & Marijana Radonjic, EdgeLeap | July 21, 2015
[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]Life sciences deal with complex, dynamic systems composed of interconnected elements driving health and disease (e.g. molecules, cells, organs, environmental factors, etc.). Graph databases such as Neo4j are very well suited to capture and model such complex relations – especially when it comes to entangled problems like heart failure. At EdgeLeap, we help our clients in the clinical, pharmaceutical and life sciences industries keep up with the quick pace of developments in the data science field and bridge the gap between data generation and data-driven decision-making. In this post, we highlight how we use Neo4j to organize patient information in context of biomedical knowledge to help clinical researchers grasp the mechanisms driving heart failure, ultimately leading to improved patient care.
Data-Driven HealthcareIn healthcare today, vast amounts of data are being collected at levels ranging from genomics to electronic health records to wearable health-monitoring sensors. This opens both a challenge and opportunity to integrate these data into a coherent picture of a person’s health and improve diagnoses and therapeutic strategies. In the EU FP7 HOMAGE consortium, clinical researchers focus on early detection and prevention of heart failure:
- How can we predict if a patient will develop heart failure?
- Why does the disease progress differently in some patients?
- Why do patients respond differently to treatment?
Ever-Changing Maps of BiologyIntegrating data and knowledge in life sciences involves modeling of an incomplete and ever-changing model of how our bodies work and what we know about it. This poses both practical and conceptual challenges. One of the practical hurdles for computing biology is that biologists and clinicians describe things differently depending on context, and these “labels” are often highly ambiguous. For instance, biologists in different subdomains seem to speak different languages: a single protein can be described by many different names (e.g., the protein “Fas cell surface death receptor” is referred to as either APT1, CD95, FAS1, APO-1, FASTM, ALPS1A, or TNFRSF6) and a single name can refer to different proteins (e.g., the protein “Fatty acid synthase” can also be referred to as FAS). There are many great initiatives aiming to capture and structure this information with identifier systems and ontologies, but even these are often redundant and disconnected. There are at least 35 different ontologies describing heart failure and associated phenotypes, and these are partly overlapping, partly tuned to a specific, unique application. And while our knowledge on biology progresses, these models continuously need to change – for example large parts of our DNA that was originally deemed junk actually turns out to be an important player in the system. On the conceptual side, being born out of an evolutionary process, biology is immensely more complex to model than any human-designed system. For example, imagine a road network connecting two cities. If we would translate this to biology, describing the way two organs communicate, the map would be dense with junctions, crossroads and bridges without any apparent logic at first sight. To find your way, you would depend on a continuously changing map: roads will within milliseconds connect, disconnect, branch, close, open depending on the state of other roads and location of traffic. Furthermore, it would be an incomplete map, since the system is so complex that decades of science have not yet charted all possible routes. In biology, a molecule in a biological system has a different functional meaning when measured in the blood or in an organ. Everything is dynamic, molecules are processed, cleaved, modified, changing their function along the way, cells grow, die, move around. Everything is connected and these connections change depending on context, time and environmental triggers.
Graph Databases to the RescueTo be able to deal with these challenges, graph database technology is becoming an indispensable part of the toolbox for data scientists in life sciences. Graph databases help model relations between entities and allow for flexible and agile data models to anticipate on new insights and needs of researchers. After relevant resources have been selected, converted and mapped to the same vocabularies, the graph database helps to store this information in a format that facilitates querying relations and paths in an effective and scalable way. In our example, the Neo4j database for the HOMAGE heart failure network analysis platform contains over 130 thousand nodes and over 6.5 million relationships. Nodes cover different types of information, parameters measured in patients, diseases, phenotypes, biological processes, drugs, molecules, genes and miRNAs. The relationships describe how these entities are associated with each other at different levels. For example, relations describe patient parameters that correlate with each other based on measurements in a specific cohort, proteins that are known to be associated to a disease, enzymes that metabolize drugs, transcription factors that regulate the expression of a gene. By integrating selected information from 20 different public databases, such as DisGeNET, DrugBank, WikiPathways and PubMed, the platform builds a comprehensive picture of knowledge relevant to heart failure.
Navigating the Heart Failure NetworkHaving all relevant parameters and knowledge in a graph database, HOMAGE researchers can now query and mine this information effectively. Bioinformaticians use the platform to build data analysis scripts and applications on top of the knowledge base to generate predictive models, quantify network topology, calculate centrality, and overlay time resolved patient data (Figure 1), to identify patterns and key players that may lead to new biomarkers or drug targets.
Figure 1: An example of the outcome of a bioinformatics analysis combining patient data with the network analysis platform. A network model reveals different molecules (nodes, scaled by centrality) and mechanisms (colored network clusters), relevant at different time points after a cardiac event.
Jointly Advancing Heart Failure ResearchAlthough originally built for bioinformaticians, the platform is rapidly gaining the attention of clinical researchers as a direct way to systematically and quickly query their favorite heart failure biomarkers against existing knowledge (Figure 2).
Figure 2: Example of a Cypher query on the network analysis platform. For a given class of drugs, this query finds the biomarkers that may be affected by intake of this drug and the mechanisms through which this link occurs. This helps clinicians to decide what biomarkers to focus on in specific patient subpopulations.Today, we even have medical doctors playing with Cypher queries, demonstrating another important implication of this project: a shift in mindset towards truly interdisciplinary effort of clinicians and data scientists. In times of increasingly datafied healthcare, this synergy is becoming a prerequisite for an effective translation of innovation resulting from research into improved patient care. As for the data science, bringing it closer to the end-user in intuitive and interactive manner will help us to move from generating big data to seeing the big picture. Graph databases such as Neo4j provide an important part of the toolset to bring us there. Want to learn more about the emerging graph database market? Click below to download this Forrester Research report that shows you how graph databases turn big data into decisive business insights.
About the Author
Thomas Kelder & Marijana Radonjic, EdgeLeap
Thomas is the co-founder and CSO of EdgeLeap. He has nearly ten years experience in bioinformatics research, with a focus on network biology, data integration and data visualization. Thomas studied Biomedical Engineering at the Eindhoven University of Technology, followed by a PhD in bioinformatics at Maastricht University. Thomas contributes to several leading open source projects in the bioinformatics field and participated in the annual Google Summer of Code projects, both as student and mentor.
Marijana is a co-founder and CEO of EdgeLeap. Her focus expertise is implementation of network and systems biology concepts into research of life science industries. At the time Marijana obtained her BSc and MSc in Molecular Biology and Physiology, the human genome had just been sequenced – giving rise to expansion of genomics technologies and a promise of understanding life’s complexity at a whole new level.
From the CEO
Have a Graph Question?
Reach out and connect with the Neo4j staff.Stackoverflow
Share your Graph Story?
Email us: email@example.com