Flexible Data Models Provide Life-Saving Insights into Complex Cancer Research Data
The IRCC team performs molecular and biological tests on cancer samples that have been collected from hospitals around Europe. They needed to develop a laboratory information management system to track the data — such as the biological and molecular properties of the cancer samples — and the subsequent scientific procedures performed on these samples. This would feed a database used to analyze data and generate high-level biological hypotheses.
However, different types of structurally complex data tend to be hierarchical with intricate and frequently-changing relationships, which necessitated a number of integrated data models. Their initial tool — the relational database, MySQL — required a large number of JOINS and resulted in sluggish queries, as well as challenges with data integration and coherency.
Whatever tool the researchers chose also needed to be available to two distinct audiences: collaborators that were sharing their data with the IRCC, as well as other groups performing similar research who needed access to their software, all with the goal of working collectively to build cancer research knowledge.
This required a flexible, efficient tool that could organize and track cancer samples, as well as their molecular and biological features; serve as a data mining resource; and function as a database for tracking procedures.
“Our application relies on complex hierarchical data, which required a more flexible model than the one provided by the traditional relational database model,” said Andrea Bertotti, MD, and the overall manager of the project.
IRCC has developed a production version of their database that relies on MySQL to store the legacy data and track entities, characteristics and laboratory procedures. This data is sent to Neo4j via scripts, and the database also continually imports data from publically-available resources.They use MongoDB to store the raw, complex data and rely on Neo4j for all the rest: finding complex relationships, analyzing their experimental procedures, and modeling the genomic domain and complex semantics for genomic knowledge.
And while they initially tried to transpose the relational table models into the graph, they plan to remodel their database and use Neo4j as a more abstract layer to generate data models for each instance in order to integrate an abstract ontology that dictates relationships.