The project will be supported by
- one of the largest chemistry companies of the world,
- one of the main developers of RDKit and
- Neo4j, Inc.
If you’re interested please get in touch with Stefan Armbruster at firstname.lastname@example.org.
RDKit & Neo4j
Academic and industrial research projects in areas such as medicinal chemistry and materials sciences accumulate – on a large scale – data of completely different natures, typically recipe, characterization and performance data. Both, the sheer amount of data and its inherent complexity make the researchers’ tasks to optimally join data in their projects difficult.
However, properly integrated data is key to success. Manual data processing and integration blocks a substantial amount of working time, is error prone, and is usually not sustainable.
The concept of knowledge graphs has proven to not only have high versatility and plasticity to comply with data management requirements mentioned above. A substantial number among the worldwide most booming enterprises are successful because they realized that there is a higher value in considering the relations between “things” rather than the “things” by themselves.
One of the strengths of a knowledge graph is to cope with arbitrary path length, a challenge frequently met in chemical research (e.g., process sequences). Moreover, the knowledge graph itself serves as an efficient communication vehicle helping to resolve and pinpoint complex data situations that are typical to chemical research and development.
Knowledge graphs are based on the fact that “things” are usually connected to each other by “relations.” The relation is expressed semantically, e.g., a process and a substance are connected by the verb “has product”: “process” – [“has product”] → “substance”. This represents an information triple. Upon combination with other triples a network is formed – the knowledge graph. Virtually anything can be mapped to such networks.
A prominent tool to instantiate knowledge graphs is Neo4j. Being developed for more than a decade, Neo4j has not only reached a mature status but also defined standards how to interact with graphs by means of our query language, Cypher. Neo4j is open source. A GPLv3 licensed community edition is available with only minor limitations to the commercial enterprise version.
For those parts of chemical research dealing with small organic molecules functionality such as (sub-)structure search is usually inevitable. Here, RDKit has proven to be a versatile and stable tool offering a vast variety of options. RDKit can already be used in conjunction with the relational database Postgres. Taking into account the value graph databases offer, a similar conjunction to Neo4j would be highly desirable.
The proposal is to marry RDKit with Neo4j to furnish chemical cartridge functionality to find entry points into the graph as well as to efficiently diminish paths by chemical know-how while traversing through the graph.