Building the Largest Knowledge Graph of Life on Earth
Basecamp Research is transforming biotech, using Neo4j’s graph database to map Earth’s biodiversity. The team at Basecamp has built partnerships with nature parks across 5 continents to collect biological & chemical data across more than 50% of global biomes to address the fundamental knowledge gap of us knowing less than 0.001% of our planet’s biodiversity.
Their groundbreaking knowledge graph pairs their proprietary protein and genome sequences with environmental and chemical data, revealing intricate biological networks. This graph, containing over 5 billion relationships, has already expanded known proteins by 50%.
Basecamp’s knowledge graph as a result has become a superior resource for protein design applications across the bioeconomy and for generative AI models. Basecamp’s knowledge graph also traces all data back to their geographic origin, which enables the company to ensure that commercial successes in the Life Science industry are shared with the biodiversity stakeholders that originated the data.
Neo4j’s scalable graph solution has proved vital for managing complex, interconnected data, revealing previously unknown genomic relationships, and identifying new proteins.
With Neo4j powering its ambitious venture, Basecamp Research is fostering biotechnological breakthroughs that promise to reshape the production of drugs, food, and diagnostics. This game-changing approach underscores the transformative potential of graph technology in leveraging patterns and relationships in complex, interconnected data.
“Neo4j helps us operate at scale more efficiently and the Graph Data Science algorithms help us get our work done more efficiently. The work we’re doing today wouldn’t be possible without Neo4j.”
Building the Largest Knowledge Graph of Life on Earth
Can the solutions to humanity’s problems be found at the ends of the earth? For Basecamp Research, this is an existential question worth exploring, quite literally. The biotech firm’s 20-person crew of field researchers, machine learning experts, and scientists believe the answer can be found in the world’s biodiversity.
They have built the world’s largest knowledge graph of the Earth’s natural biodiversity by going to the most extreme ends of the world and taking samples from underexplored environments.
All biotechnological products and AI algorithms used in the Life Sciences are fundamentally rooted in our understanding of life on Earth. Yet, with an estimated trillion species inhabiting our planet, more than 99.9% of them remain undiscovered and unstudied. This vast pool of unknown lifeforms leads to an inherent bias in our public genome sequencing databases, making them highly unrepresentative of the true diversity of life. Basecamp addresses this knowledge gap by sourcing a more inclusive and representative range of data for biotechnological products and AI applications – a strategic move that empowers researchers to design innovative proteins previously thought unachievable, more accurately identify the most promising candidates for experimentation, and develop superior drugs, food products, and diagnostics. Ultimately, these advancements will have a profound impact, benefiting not only humanity but also our planet.
This endeavor requires an enormous genomic database. However, unlike public databases which are often organized as lists or catalogs, Basecamp Research enhances the value of every protein and genome sequence by associating it with relevant environmental and chemical data. This approach provides a comprehensive understanding of complex biological networks and their interactions with their surroundings that only a graph solution can uncover. Today, Basecamp Research’s knowledge graph, BaseGraph™ – built on Neo4j – contains over 5 billion biological relationships, with 500 million new ones uncovered every 4 weeks, according to the company, increasing the number of proteins known to science by 50%, expanding predictive discovery, and revealing new insights on how life on earth works.
Better knowledge of biodiversity also leads to better commercial success. As Forbes put it, the “global bioeconomy is about to take off as manufacturers choose biology as the method of choice to effectively produce high-performance sustainable products. Synthetic biology is at the leading edge of the $4 trillion gold rush.”
Navigating Data Complexity and Connectivity
Choosing the right database to handle the complex, interconnected data was a crucial decision for Basecamp Research.
“My first instinct was ‘stick it all in tables and JOIN it’,” said Saif Ur-Rehman, Data Engineering Team Lead at Basecamp Research. So that’s how they started out. But after exploring relational and several NOSQL database options, graph proved to be the most logical choice for data that was so highly connected and variable. As Basecamp’s CTO Phil Lorenz observes, “Life works as a network, not as a list.”
The data collection process starts with a legal permit to collect an environmental sample. The process then calls for bringing in all metadata surrounding the sample, for example, the temperature, the pH of the soil, and hundreds of other variables. Then, the team at Basecamp extracts and annotates the DNA of the (micro)organisms in the sample. To make this process smoother, Basecamp has built a fully automated annotation pipeline called BaseScan, which generates millions of biological labels & annotations for each sample that are integrated into BaseGraph automatically.
“So you’ve very quickly just gone from one biological entity to millions and millions of data points,” Ur-Rehman said. “We did an empirical study, and we also thought about it theoretically. Fundamentally, graph won out, because any piece of annotation that you have in the molecule is patchy. You might have five pieces of annotation on one molecule, and none on another. Relational databases will not handle that well. You end up with a whole bunch of tables with a lot of N/As in them. Which is not particularly useful for querying or performance purposes.”
Deciding on a Graph Database
At its core, Basecamp Research’s business is a graph database – which means that its success depends on leveraging the right database technology. So the question becomes less a matter of why graph, and more why Neo4j?
“The graph nature of what we’re building is a key part of our product offering and of our commercial solutions,” said Lorenz. “A huge advantage of using Neo4j is that it offers some low code solutions that our commercial team can interact with very easily. That makes a lot of processes very scalable.”
Neither Lorenz nor any one of the founding team members came from a graph background, but that was no obstacle. “It has been a seamless journey, which I attribute mostly to Neo4j’s support and the way they’ve helped us out,” Lorenz said, noting that Neo4j had convinced him of the power of graphs, with new graph deep learning scientists having joined the Basecamp team. Lorenz was also drawn to Graph Data Science, Neo4j’s advanced analytics and machine learning (ML) solution, for getting more out of BaseGraph, from uncovering unknown genomic relationships to identifying new proteins.
Lorenz also highlighted the Bloom data visualization tool that Neo4j makes available to its users. “Bloom and a lot of the local solutions that we can find in Neo4j are super attractive for our commercial team and product scientists,” he said.
Basecamp conducted additional due diligence on other graph database tools on the market, but the choice was clear. “For me Neo4j was always the really obvious choice because of its local solutions and the support we knew we were going to get,” Lorenz said.
“Neo4j is where all of our data lives. We’re at 5 billion relationships now, and it’s growing every day because our teams are constantly out there and the data just keeps coming,” Ur-Rehman said.
Using Connected Data to Uncover the Hidden World of Microbial Dark Matter
With Neo4j at its core, Basecamp Research developed a multidimensional knowledge graph that maps three broad categories:
- Environmental, geological, and chemical conditions
- Microecology, metagenomics, and genomic context
- Deep learning-derived functional and structural protein characteristics
The vast and connected network of relationships in the knowledge graph allows the research team to observe hidden rules governing protein evolution, and then take those rules to generate protein design insights that ultimately shed light on “microbial dark matter” – a term referring to the overwhelming majority of microorganisms that remain unexplored and uncharacterized, thereby expanding our understanding of the world’s biological diversity.
Biotechnology Breakthroughs With Neo4j
By leveraging graph embeddings available through Neo4j Graph Data Science, Basecamp is able to represent proteins not just through their sequence alone, but incorporate essential contextual information that can show how these proteins will interact, behave, and ultimately perform. Using context for downstream tasks such as annotating dark matter proteins has also enabled the product team to annotate gene-editing systems used in therapeutic applications that have 0% sequence similarity to anything in public databases, offering their therapeutics customers novel biology as well as a much greater ability generate new IP for launching a new product in the market.
The knowledge graph Basecamp has built over the past two years enables even more complex therapeutic product development opportunities : A novel technology for gene-writing applications has recently been discovered by leading academic groups, based on an enzyme called Large Serine Recombinases. They would enable us not just to edit pieces of DNA in the human genome, but write entire genes into it, opening up even more therapeutic possibilities. When compared to what can be mined from public data, BaseGraph™ has 30x more of these LSRs, capturing a wealth of potential for this technology. Its representation in a graph also makes prioritisation and characterisation for therapeutic applications much more achievable than their public counterparts.
Basecamp’s work has also yielded breakthroughs in the chemical industry. As just one example, a $16 billion chemical manufacturing customer who had taken two years to optimize a specific enzyme, was able to achieve this result in one month after signing with Basecamp who leveraged one of Neo4j’s graph algorithms within a network of proteins in its system.
By using Basecamp Research’s knowledge graph to uncover new connections between the genomic and taxonomic contents of samples collected from around the world, the team will continue to make new discoveries and advance biotechnology, ultimately enabling the design of unique protein products for improved drugs, food, and diagnostics.
A Biological Data Resource Purpose-Built for AI
Over the past two years, a number of exciting advances in the Life Sciences were enabled through the application of deep learning models on biological sequence data, such as AlphaFold2. With orders of magnitude greater sequence diversity compared to the data that these models were trained on, Basecamp can leverage this data advantage to improve the performance of these models. Many of these advances in protein AI are, however, only centered around intrinsic features of the protein such as its sequence or structure. The metadata in Basecamp’s knowledge graph by contrast enables its deep learning team to create representations of proteins that capture their context, as well, for example through graph embeddings that are part of Neo4J’s graph data science library.
What’s next? Basecamp has already been working on large language models to design proteins, leveraging a ChatGPT-style model for enzyme sequence generation called ZymCtrl1. With BaseGraph being purpose-built for generative AI, the team is now integrating large language models with their entire knowledge graph. “We are currently upgrading BaseGraph to an LLM-augmented knowledge graph. This would essentially enable us to have design-support for our customers from life on earth itself in the way it’s captured in our knowledge graph with over 5 billion relationships,” said Lorenz. “Imagine you could “talk” to nature, or have our planet’s biodiversity as your copilot when designing biotech products. Graph with generative AI will make this possible for us.”
- Munsamy et al. (2022): ZymCtrl: a conditional language model for the controllable generation of artificial enzymes. NeurIPS 2022 ↩︎