Single-omics Data Integration

Integrating omics data into Knowledge Graphs typically begins with a single data type, such as RNA-seq experiments, which provides a foundation for subsequent multi-omics integration. By representing RNA-seq data in Neo4j, researchers can move beyond static tables and embrace the inherently connected nature of transcriptomic information.

Each transcript measurement becomes a node connected to its corresponding gene, sample metadata, experimental conditions, and biological annotations. This graph structure naturally captures the relationships between differentially expressed genes, their protein products, associated pathways, and clinical phenotypes—all queryable through intuitive graph traversals. For instance, researchers can rapidly identify co-expressed gene modules across multiple experiments, trace expression changes from transcript to protein to pathway, or link expression signatures to patient outcomes. Neo4j’s native graph architecture transforms RNA-seq data from isolated datasets into an integrated molecular map that accelerates hypothesis generation and biological discovery.

Scenario

A typical research scenario involves comparing cases (people with a specific outcome) versus control (those without the outcome) samples to identify molecular signatures of pathology. Researchers collect tissue samples from both cases and controls - for example, liver biopsies from patients with non-alcoholic fatty liver disease versus patients without. These samples undergo RNA-seq experimentation, where raw sequencing data is processed through bioinformatics pipelines to identify differentially expressed genes between the two groups. The analysed results, such as the top 10 upregulated or downregulated genes with their fold-changes and statistical significance, are then stored in Neo4j.

In the graph database, each biological sample becomes a node with properties describing its tissue type, patient demographics, and group assignment (control or case). Sample nodes connect to experiment nodes representing the RNA-seq analysis, which in turn link to gene nodes for the differentially expressed transcripts. This structure makes it intuitive to query across experiments—for instance, identifying genes consistently dysregulated across multiple disease cohorts, tracing which pathways are affected, or connecting expression changes to known protein functions. The graph naturally represents the experimental design, preserving the relationship between samples, conditions, and molecular measurements in a way that supports both immediate analysis and future integration with additional omics layers.

Our scenario starts from a foundational single omics scenario, illustrating how transcriptomics data (like RNA-seq) is modelled in a graph structure. We progressively expand this core by demonstrating how the same framework is extended to handle multi-omics integration and incorporate rich information from biomedical ontologies. As the model grows, more complex biological and clinical questions can be addressed directly in the graph.

Solution

This solution builds a single-omics model for RNA-seq data integration using Neo4j.

Accommodating Heterogeneous Data

Even within a single omics type like RNA-seq, experiments can vary dramatically. Different tissue sources, diverse phenotypic annotations, multiple comparison groups, and varying levels of biological annotation. The graph model handles this heterogeneity seamlessly. Projects connect to samples with flexible properties, samples link to their tissue origins and phenotypes (using standardized vocabularies like EFO), while experiments capture method-specific metadata. Critically, the Comparison node enables researchers to model complex experimental designs—whether comparing disease versus control, treatment versus baseline, or multiple conditions simultaneously—without restructuring the entire database.

Bridging Identifiers and Knowledge

The model addresses a fundamental challenge in omics integration: connecting experimental measurements to standardized biological entities. Gene identifiers from RNA-seq analysis link to multiple ID nodes from different sources (Ensembl, NCBI, HGNC), which then map to canonical Gene nodes. These genes connect onwards to their encoded Proteins, associated Gene Ontology terms, and disease relationships. This identifier-bridging layer is crucial because different experiments and databases use different naming conventions, yet the graph structure makes these mappings explicit and traversable.

Flexibility for Discovery

The schema’s flexibility truly shines when solving biological questions. Need to find all differentially expressed genes across multiple liver disease experiments? Simply traverse from Disease → Gene ← Experiment ← Sample ← Tissue. Want to identify genes consistently dysregulated in a specific phenotype regardless of tissue type? The graph makes this natural. Looking for functional enrichment? Follow Gene → Protein → GO paths. The graph’s property model also allows each node type to carry experiment-specific attributes—fold-changes, p-values, expression levels—without forcing every experiment to share identical properties.

This adaptable structure means researchers can integrate new RNA-seq datasets incrementally, add novel annotation layers as they become available, and explore biological connections without extensive database redesign—transforming single omics analysis from a data management burden into an opportunity for insight.

Data Model

The graph data model demonstrates how Neo4j’s flexible schema elegantly addresses the inherent heterogeneity of single omics data. Unlike rigid relational databases that require predefined table structures, this graph architecture allows researchers to naturally represent the diverse experimental contexts, varied sample annotations, and multiple identifier systems that characterize real-world omics projects.

Industry Use Cases single omics
Figure 1. Single OMICs Data Model

We will continue to build on this model by adding more omics data types and ontologies in the Multi-OMICs Data Integration page.

Demo Data

The following Cypher statements will create an example graph in the Neo4j database:

// ============================================
// EXTENDED OMICS DATASET WITH COMPARISON NODES
// ============================================

// MERGE Projects
MERGE (p1:Project {sid: "PROJ001", name: "Liver Disease Study"})
MERGE (p2:Project {sid: "PROJ002", name: "Diabetes Research"})
MERGE (p3:Project {sid: "PROJ003", name: "Cross-Disease Metabolic Study"})

// MERGE Tissues
MERGE (t1:Tissue {sid: "UBERON:0002107", name: "Liver"})
MERGE (t2:Tissue {sid: "UBERON:0001264", name: "Pancreas"})
MERGE (t3:Tissue {sid: "UBERON:0000945", name: "Adipose tissue"})

// MERGE Diseases
MERGE (d1:Disease {sid: "MONDO:0005359", name: "Non-alcoholic fatty liver disease"})
MERGE (d2:Disease {sid: "MONDO:0005015", name: "Type 2 Diabetes"})
MERGE (d3:Disease {sid: "MONDO:0011382", name: "Metabolic Syndrome"})

// MERGE Phenotypes (EFO)
MERGE (ph1:EFO {sid: "EFO:0004220", name: "Insulin resistance"})
MERGE (ph2:EFO {sid: "EFO:0001421", name: "Elevated triglycerides"})
MERGE (ph3:EFO {sid: "EFO:0004465", name: "Hepatic steatosis"})
MERGE (ph4:EFO {sid: "EFO:0000685", name: "Obesity"})

// MERGE Samples
MERGE (s1:Sample {sid: "SAMPLE001", name: "Patient_001_Liver", condition: "NAFLD"})
MERGE (s2:Sample {sid: "SAMPLE002", name: "Control_001_Liver", condition: "Healthy"})
MERGE (s3:Sample {sid: "SAMPLE003", name: "Patient_002_Pancreas", condition: "T2D"})
MERGE (s4:Sample {sid: "SAMPLE004", name: "Control_002_Pancreas", condition: "Healthy"})
MERGE (s5:Sample {sid: "SAMPLE005", name: "Patient_003_Liver", condition: "NAFLD"})
MERGE (s6:Sample {sid: "SAMPLE006", name: "Patient_004_Adipose", condition: "MetSyn"})
MERGE (s7:Sample {sid: "SAMPLE007", name: "Control_003_Adipose", condition: "Healthy"})

// MERGE Experiments
MERGE (e1:Experiment {sid: "EXP001", type: "RNA-seq", platform: "Illumina NovaSeq"})
MERGE (e2:Experiment {sid: "EXP002", type: "RNA-seq", platform: "Illumina NovaSeq"})
MERGE (e3:Experiment {sid: "EXP003", type: "RNA-seq", platform: "Illumina NovaSeq"})

// ============================================
// EXTENDED COMPARISON NODES
// ============================================

// Basic disease vs control comparisons
MERGE (comp1:Comparison {
  sid: "COMP001",
  name: "NAFLD vs Control (Liver)",
  type: "disease_vs_control",
  tissue: "Liver",
  n_case: 2,
  n_control: 1,
  analysis_date: "2024-01-15"
})

MERGE (comp2:Comparison {
  sid: "COMP002",
  name: "T2D vs Control (Pancreas)",
  type: "disease_vs_control",
  tissue: "Pancreas",
  n_case: 1,
  n_control: 1,
  analysis_date: "2024-01-20"
})

MERGE (comp3:Comparison {
  sid: "COMP003",
  name: "Metabolic Syndrome vs Control (Adipose)",
  type: "disease_vs_control",
  tissue: "Adipose",
  n_case: 1,
  n_control: 1,
  analysis_date: "2024-02-01"
})

// Cross-tissue comparisons (same disease, different tissues)
MERGE (comp4:Comparison {
  sid: "COMP004",
  name: "NAFLD Liver vs T2D Pancreas",
  type: "cross_tissue_disease",
  tissue: "Liver vs Pancreas",
  description: "Compare molecular signatures between NAFLD and T2D",
  analysis_date: "2024-02-10"
})

// Phenotype-based comparison
MERGE (comp6:Comparison {
  sid: "COMP006",
  name: "Insulin Resistant vs Non-Resistant",
  type: "phenotype_stratified",
  stratification: "Insulin resistance status",
  description: "Compare samples with vs without insulin resistance",
  analysis_date: "2024-02-20"
})

// Multi-disease meta-analysis
MERGE (comp8:Comparison {
  sid: "COMP008",
  name: "Pan-Metabolic Disease Signature",
  type: "meta_analysis",
  diseases: "NAFLD, T2D, MetSyn",
  description: "Common molecular signatures across metabolic diseases",
  analysis_date: "2024-03-05"
})

// ============================================
// MERGE GENES with expression data
// ============================================

MERGE (g1:Gene {sid: "ENSG00000105851", symbol: "PIK3CG", name: "Phosphatidylinositol-3-kinase catalytic gamma", source: "Ensembl"})
MERGE (g2:Gene {sid: "ENSG00000169245", symbol: "CXCL10", name: "C-X-C motif chemokine ligand 10", source: "Ensembl"})
MERGE (g3:Gene {sid: "ENSG00000198793", symbol: "MTOR", name: "Mechanistic target of rapamycin kinase", source: "Ensembl"})
MERGE (g4:Gene {sid: "ENSG00000134108", symbol: "AKT1", name: "AKT serine/threonine kinase 1", source: "Ensembl"})
MERGE (g5:Gene {sid: "ENSG00000171408", symbol: "PPARG", name: "Peroxisome proliferator activated receptor gamma", source: "Ensembl"})
MERGE (g6:Gene {sid: "ENSG00000108932", symbol: "CD36", name: "CD36 molecule", source: "Ensembl"})
MERGE (g7:Gene {sid: "ENSG00000163631", symbol: "ALB", name: "Albumin", source: "Ensembl"})
MERGE (g8:Gene {sid: "ENSG00000169429", symbol: "CXCL8", name: "C-X-C motif chemokine ligand 8", source: "Ensembl"})

// MERGE IDs (alternative identifiers)
MERGE (id1:ID {sid: "5294", source: "NCBI"})
MERGE (id2:ID {sid: "3627", source: "NCBI"})
MERGE (id3:ID {sid: "2475", source: "NCBI"})
MERGE (id4:ID {sid: "207", source: "NCBI"})
MERGE (id5:ID {sid: "5468", source: "NCBI"})

// MERGE Proteins
MERGE (pr1:Protein {sid: "P48736", source: "UniProt", name: "PIK3CG"})
MERGE (pr2:Protein {sid: "P02778", source: "UniProt", name: "CXCL10"})
MERGE (pr3:Protein {sid: "P42345", source: "UniProt", name: "MTOR"})
MERGE (pr4:Protein {sid: "P31749", source: "UniProt", name: "AKT1"})
MERGE (pr5:Protein {sid: "P37231", source: "UniProt", name: "PPARG"})
MERGE (pr6:Protein {sid: "P16671", source: "UniProt", name: "CD36"})
MERGE (pr7:Protein {sid: "P02768", source: "UniProt", name: "ALB"})

// MERGE GO terms
MERGE (go1:GO {sid: "GO:0005158", name: "insulin receptor binding"})
MERGE (go2:GO {sid: "GO:0006954", name: "inflammatory response"})
MERGE (go4:GO {sid: "GO:0008286", name: "insulin receptor signaling pathway"})
MERGE (go5:GO {sid: "GO:0006629", name: "lipid metabolic process"})
MERGE (go6:GO {sid: "GO:0006955", name: "immune response"});

// ============================================
// RELATIONSHIPS: PROJECT -> SAMPLE
// ============================================

WITH ["SAMPLE001", "SAMPLE002", "SAMPLE005"] AS samples
MATCH (p:Project {sid: "PROJ001"}), (s:Sample)
WHERE s.sid IN samples
MERGE (p)-[:HAS_SAMPLE]->(s);

WITH ["SAMPLE003", "SAMPLE004"] AS samples
MATCH (p:Project {sid: "PROJ002"}), (s:Sample)
WHERE s.sid IN samples
MERGE (p)-[:HAS_SAMPLE]->(s);

WITH ["SAMPLE006", "SAMPLE007"] AS samples
MATCH (p:Project {sid: "PROJ003"}), (s:Sample)
WHERE s.sid IN samples
MERGE (p)-[:HAS_SAMPLE]->(s);


// ============================================
// RELATIONSHIPS: SAMPLE -> TISSUE
// ============================================

WITH ["SAMPLE001", "SAMPLE002", "SAMPLE005"] AS samples
MATCH (s:Sample), (t:Tissue {sid: "UBERON:0002107"})
WHERE s.sid IN samples
MERGE (s)-[:TAKEN_FROM]->(t);

WITH ["SAMPLE003", "SAMPLE004"] AS samples
MATCH (s:Sample), (t:Tissue {sid: "UBERON:0001264"})
WHERE s.sid IN samples
MERGE (s)-[:TAKEN_FROM]->(t);

WITH ["SAMPLE006", "SAMPLE007"] AS samples
MATCH (s:Sample), (t:Tissue {sid: "UBERON:0000945"})
WHERE s.sid IN samples
MERGE (s)-[:TAKEN_FROM]->(t);


// ============================================
// RELATIONSHIPS: SAMPLE -> PHENOTYPE
// ============================================

MATCH (s:Sample {sid: "SAMPLE001"}), (ph:EFO {sid: "EFO:0001421"})
MERGE (s)-[:HAS_PHENOTYPE]->(ph);
MATCH (s:Sample {sid: "SAMPLE001"}), (ph:EFO {sid: "EFO:0004465"})
MERGE (s)-[:HAS_PHENOTYPE]->(ph);

MATCH (s:Sample {sid: "SAMPLE003"}), (ph:EFO {sid: "EFO:0004220"})
MERGE (s)-[:HAS_PHENOTYPE]->(ph);

MATCH (s:Sample {sid: "SAMPLE005"}), (ph:EFO {sid: "EFO:0004465"})
MERGE (s)-[:HAS_PHENOTYPE]->(ph);

MATCH (s:Sample {sid: "SAMPLE006"}), (ph:EFO {sid: "EFO:0000685"})
MERGE (s)-[:HAS_PHENOTYPE]->(ph);
MATCH (s:Sample {sid: "SAMPLE006"}), (ph:EFO {sid: "EFO:0004220"})
MERGE (s)-[:HAS_PHENOTYPE]->(ph);

// ============================================
// RELATIONSHIPS: SAMPLE -> EXPERIMENT
// ============================================

MATCH (s:Sample {sid: "SAMPLE001"}), (e:Experiment {sid: "EXP001"})
MERGE (s)-[:HAS_EXPERIMENT]->(e);
MATCH (s:Sample {sid: "SAMPLE002"}), (e:Experiment {sid: "EXP001"})
MERGE (s)-[:HAS_EXPERIMENT]->(e);
MATCH (s:Sample {sid: "SAMPLE005"}), (e:Experiment {sid: "EXP001"})
MERGE (s)-[:HAS_EXPERIMENT]->(e);

MATCH (s:Sample {sid: "SAMPLE003"}), (e:Experiment {sid: "EXP002"})
MERGE (s)-[:HAS_EXPERIMENT]->(e);
MATCH (s:Sample {sid: "SAMPLE004"}), (e:Experiment {sid: "EXP002"})
MERGE (s)-[:HAS_EXPERIMENT]->(e);

MATCH (s:Sample {sid: "SAMPLE006"}), (e:Experiment {sid: "EXP003"})
MERGE (s)-[:HAS_EXPERIMENT]->(e);
MATCH (s:Sample {sid: "SAMPLE007"}), (e:Experiment {sid: "EXP003"})
MERGE (s)-[:HAS_EXPERIMENT]->(e);

// ============================================
// RELATIONSHIPS: COMPARISON -> EXPERIMENT
// ============================================

// Basic comparisons to experiments
MATCH (c:Comparison {sid: "COMP001"}), (e:Experiment {sid: "EXP001"})
MERGE (c)-[:COMPARES]->(e);

MATCH (c:Comparison {sid: "COMP002"}), (e:Experiment {sid: "EXP002"})
MERGE (c)-[:COMPARES]->(e);

MATCH (c:Comparison {sid: "COMP003"}), (e:Experiment {sid: "EXP003"})
MERGE (c)-[:COMPARES]->(e);

// Cross-tissue comparison
MATCH (c:Comparison {sid: "COMP004"}), (e:Experiment {sid: "EXP001"})
MERGE (c)-[:COMPARES]->(e);
MATCH (c:Comparison {sid: "COMP004"}), (e:Experiment {sid: "EXP002"})
MERGE (c)-[:COMPARES]->(e);

// Meta-analysis comparison (all experiments)
MATCH (c:Comparison {sid: "COMP008"}), (e:Experiment {sid: "EXP001"})
MERGE (c)-[:COMPARES]->(e);
MATCH (c:Comparison {sid: "COMP008"}), (e:Experiment {sid: "EXP002"})
MERGE (c)-[:COMPARES]->(e);
MATCH (c:Comparison {sid: "COMP008"}), (e:Experiment {sid: "EXP003"})
MERGE (c)-[:COMPARES]->(e);

// ============================================
// RELATIONSHIPS: COMPARISON -> SAMPLE (Direct)
// ============================================

// COMP001: NAFLD vs Control samples
MATCH (c:Comparison {sid: "COMP001"}), (s:Sample {sid: "SAMPLE001"})
MERGE (c)-[:INCLUDES_CASE]->(s);
MATCH (c:Comparison {sid: "COMP001"}), (s:Sample {sid: "SAMPLE005"})
MERGE (c)-[:INCLUDES_CASE]->(s);
MATCH (c:Comparison {sid: "COMP001"}), (s:Sample {sid: "SAMPLE002"})
MERGE (c)-[:INCLUDES_CONTROL]->(s);

// COMP002: T2D vs Control samples
MATCH (c:Comparison {sid: "COMP002"}), (s:Sample {sid: "SAMPLE003"})
MERGE (c)-[:INCLUDES_CASE]->(s);
MATCH (c:Comparison {sid: "COMP002"}), (s:Sample {sid: "SAMPLE004"})
MERGE (c)-[:INCLUDES_CONTROL]->(s);

// COMP003: MetSyn vs Control samples
MATCH (c:Comparison {sid: "COMP003"}), (s:Sample {sid: "SAMPLE006"})
MERGE (c)-[:INCLUDES_CASE]->(s);
MATCH (c:Comparison {sid: "COMP003"}), (s:Sample {sid: "SAMPLE007"})
MERGE (c)-[:INCLUDES_CONTROL]->(s);

// COMP006: Phenotype-stratified (Insulin Resistant)
MATCH (c:Comparison {sid: "COMP006"}), (s:Sample)-[:HAS_PHENOTYPE]->(ph:EFO {sid: "EFO:0004220"})
MERGE (c)-[:INCLUDES_CASE]->(s);
MATCH (c:Comparison {sid: "COMP006"}), (s:Sample)
WHERE NOT (s)-[:HAS_PHENOTYPE]->(:EFO {sid: "EFO:0004220"})
  AND s.condition = "Healthy"
MERGE (c)-[:INCLUDES_CONTROL]->(s);

// ============================================
// RELATIONSHIPS: COMPARISON -> DISEASE
// ============================================

MATCH (c:Comparison {sid: "COMP001"}), (d:Disease {sid: "MONDO:0005359"})
MERGE (c)-[:STUDIES_DISEASE]->(d);

MATCH (c:Comparison {sid: "COMP002"}), (d:Disease {sid: "MONDO:0005015"})
MERGE (c)-[:STUDIES_DISEASE]->(d);

MATCH (c:Comparison {sid: "COMP003"}), (d:Disease {sid: "MONDO:0011382"})
MERGE (c)-[:STUDIES_DISEASE]->(d);

MATCH (c:Comparison {sid: "COMP004"}), (d:Disease {sid: "MONDO:0005359"})
MERGE (c)-[:STUDIES_DISEASE]->(d);
MATCH (c:Comparison {sid: "COMP004"}), (d:Disease {sid: "MONDO:0005015"})
MERGE (c)-[:STUDIES_DISEASE]->(d);

// Meta-analysis
MATCH (c:Comparison {sid: "COMP008"}), (d:Disease {sid: "MONDO:0005359"})
MERGE (c)-[:STUDIES_DISEASE]->(d);
MATCH (c:Comparison {sid: "COMP008"}), (d:Disease {sid: "MONDO:0005015"})
MERGE (c)-[:STUDIES_DISEASE]->(d);
MATCH (c:Comparison {sid: "COMP008"}), (d:Disease {sid: "MONDO:0011382"})
MERGE (c)-[:STUDIES_DISEASE]->(d);

// ============================================
// RELATIONSHIPS: EXPERIMENT -> GENE (with expression)
// ============================================

// EXP001 - NAFLD signature
MATCH (e:Experiment {sid: "EXP001"}), (g:Gene {symbol: "PIK3CG"})
MERGE (e)-[:HAS_VALUE {foldChange: 2.5, pValue: 0.001, regulated: "up", baseMean: 1250.3}]->(g);

MATCH (e:Experiment {sid: "EXP001"}), (g:Gene {symbol: "CXCL10"})
MERGE (e)-[:HAS_VALUE {foldChange: 3.2, pValue: 0.0005, regulated: "up", baseMean: 890.5}]->(g);

MATCH (e:Experiment {sid: "EXP001"}), (g:Gene {symbol: "CD36"})
MERGE (e)-[:HAS_VALUE {foldChange: 2.8, pValue: 0.0008, regulated: "up", baseMean: 3200.1}]->(g);

MATCH (e:Experiment {sid: "EXP001"}), (g:Gene {symbol: "ALB"})
MERGE (e)-[:HAS_VALUE {foldChange: -1.5, pValue: 0.02, regulated: "down", baseMean: 45000.8}]->(g);

// EXP002 - T2D signature
MATCH (e:Experiment {sid: "EXP002"}), (g:Gene {symbol: "MTOR"})
MERGE (e)-[:HAS_VALUE {foldChange: -1.8, pValue: 0.01, regulated: "down", baseMean: 1580.2}]->(g);

MATCH (e:Experiment {sid: "EXP002"}), (g:Gene {symbol: "AKT1"})
MERGE (e)-[:HAS_VALUE {foldChange: 2.1, pValue: 0.002, regulated: "up", baseMean: 2100.4}]->(g);

MATCH (e:Experiment {sid: "EXP002"}), (g:Gene {symbol: "PPARG"})
MERGE (e)-[:HAS_VALUE {foldChange: -2.3, pValue: 0.0003, regulated: "down", baseMean: 980.6}]->(g);

MATCH (e:Experiment {sid: "EXP002"}), (g:Gene {symbol: "CXCL8"})
MERGE (e)-[:HAS_VALUE {foldChange: 2.9, pValue: 0.0006, regulated: "up", baseMean: 1340.2}]->(g);

// EXP003 - Metabolic Syndrome signature
MATCH (e:Experiment {sid: "EXP003"}), (g:Gene {symbol: "CD36"})
MERGE (e)-[:HAS_VALUE {foldChange: 3.5, pValue: 0.0001, regulated: "up", baseMean: 2890.7}]->(g);

MATCH (e:Experiment {sid: "EXP003"}), (g:Gene {symbol: "PIK3CG"})
MERGE (e)-[:HAS_VALUE {foldChange: 1.9, pValue: 0.004, regulated: "up", baseMean: 1100.3}]->(g);

// ============================================
// RELATIONSHIPS: COMPARISON -> GENE (DGE Results)
// ============================================

// COMP001 results
MATCH (c:Comparison {sid: "COMP001"}), (g:Gene {symbol: "PIK3CG"})
MERGE (c)-[:FOUND_DIFFERENTIAL {
  logFC: 2.5,
  pValue: 0.001,
  adjPValue: 0.015,
  regulated: "up",
  significance: "significant"
}]->(g);

MATCH (c:Comparison {sid: "COMP001"}), (g:Gene {symbol: "CXCL10"})
MERGE (c)-[:FOUND_DIFFERENTIAL {
  logFC: 3.2,
  pValue: 0.0005,
  adjPValue: 0.008,
  regulated: "up",
  significance: "significant"
}]->(g);

MATCH (c:Comparison {sid: "COMP001"}), (g:Gene {symbol: "CD36"})
MERGE (c)-[:FOUND_DIFFERENTIAL {
  logFC: 2.8,
  pValue: 0.0008,
  adjPValue: 0.012,
  regulated: "up",
  significance: "significant"
}]->(g);

// COMP002 results
MATCH (c:Comparison {sid: "COMP002"}), (g:Gene {symbol: "MTOR"})
MERGE (c)-[:FOUND_DIFFERENTIAL {
  logFC: -1.8,
  pValue: 0.01,
  adjPValue: 0.045,
  regulated: "down",
  significance: "significant"
}]->(g);

MATCH (c:Comparison {sid: "COMP002"}), (g:Gene {symbol: "AKT1"})
MERGE (c)-[:FOUND_DIFFERENTIAL {
  logFC: 2.1,
  pValue: 0.002,
  adjPValue: 0.018,
  regulated: "up",
  significance: "significant"
}]->(g);

// COMP008 - Meta-analysis (shared signatures)
MATCH (c:Comparison {sid: "COMP008"}), (g:Gene {symbol: "PIK3CG"})
MERGE (c)-[:FOUND_DIFFERENTIAL {
  logFC: 2.2,
  pValue: 0.0001,
  adjPValue: 0.005,
  regulated: "up",
  significance: "significant",
  note: "Shared across NAFLD and MetSyn"
}]->(g);

// ============================================
// RELATIONSHIPS: GENE -> PROTEIN
// ============================================

MATCH (g:Gene {symbol: "PIK3CG"}), (p:Protein {sid: "P48736"})
MERGE (g)-[:CODES]->(p);
MATCH (g:Gene {symbol: "CXCL10"}), (p:Protein {sid: "P02778"})
MERGE (g)-[:CODES]->(p);
MATCH (g:Gene {symbol: "MTOR"}), (p:Protein {sid: "P42345"})
MERGE (g)-[:CODES]->(p);
MATCH (g:Gene {symbol: "AKT1"}), (p:Protein {sid: "P31749"})
MERGE (g)-[:CODES]->(p);
MATCH (g:Gene {symbol: "PPARG"}), (p:Protein {sid: "P37231"})
MERGE (g)-[:CODES]->(p);
MATCH (g:Gene {symbol: "CD36"}), (p:Protein {sid: "P16671"})
MERGE (g)-[:CODES]->(p);
MATCH (g:Gene {symbol: "ALB"}), (p:Protein {sid: "P02768"})
MERGE (g)-[:CODES]->(p);

// ============================================
// RELATIONSHIPS: GENE -> ID
// ============================================

MATCH (g:Gene {symbol: "PIK3CG"}), (id:ID {sid: "5294"})
MERGE (g)-[:MAPPED]->(id);
MATCH (g:Gene {symbol: "CXCL10"}), (id:ID {sid: "3627"})
MERGE (g)-[:MAPPED]->(id);
MATCH (g:Gene {symbol: "MTOR"}), (id:ID {sid: "2475"})
MERGE (g)-[:MAPPED]->(id);
MATCH (g:Gene {symbol: "AKT1"}), (id:ID {sid: "207"})
MERGE (g)-[:MAPPED]->(id);
MATCH (g:Gene {symbol: "PPARG"}), (id:ID {sid: "5468"})
MERGE (g)-[:MAPPED]->(id);

// ============================================
// RELATIONSHIPS: GENE -> DISEASE
// ============================================

MATCH (g:Gene {symbol: "PIK3CG"}), (d:Disease {sid: "MONDO:0005359"})
MERGE (g)-[:RELATED_TO {source: "DisGeNET", score: 0.75}]->(d);

MATCH (g:Gene {symbol: "PPARG"}), (d:Disease {sid: "MONDO:0005015"})
MERGE (g)-[:RELATED_TO {source: "OpenTargets", score: 0.85}]->(d);

MATCH (g:Gene {symbol: "MTOR"}), (d:Disease {sid: "MONDO:0005015"})
MERGE (g)-[:RELATED_TO {source: "DisGeNET", score: 0.68}]->(d);

MATCH (g:Gene {symbol: "CD36"}), (d:Disease {sid: "MONDO:0011382"})
MERGE (g)-[:RELATED_TO {source: "OpenTargets", score: 0.72}]->(d);

MATCH (g:Gene {symbol: "AKT1"}), (d:Disease {sid: "MONDO:0005015"})
MERGE (g)-[:RELATED_TO {source: "DisGeNET", score: 0.81}]->(d);

// ============================================
// RELATIONSHIPS: PROTEIN -> GO
// ============================================

MATCH (p:Protein {sid: "P48736"}), (go:GO {sid: "GO:0008286"})
MERGE (p)-[:ASSOCIATED_WITH {source: "GO"}]->(go);

MATCH (p:Protein {sid: "P02778"}), (go:GO {sid: "GO:0006954"})
MERGE (p)-[:ASSOCIATED_WITH {source: "GO"}]->(go);

MATCH (p:Protein {sid: "P42345"}), (go:GO {sid: "GO:0008286"})
MERGE (p)-[:ASSOCIATED_WITH {source: "GO"}]->(go);

MATCH (p:Protein {sid: "P31749"}), (go:GO {sid: "GO:0008286"})
MERGE (p)-[:ASSOCIATED_WITH {source: "GO"}]->(go);

MATCH (p:Protein {sid: "P37231"}), (go:GO {sid: "GO:0005158"})
MERGE (p)-[:ASSOCIATED_WITH {source: "GO"}]->(go);

MATCH (p:Protein {sid: "P16671"}), (go:GO {sid: "GO:0006629"})
MERGE (p)-[:ASSOCIATED_WITH {source: "GO"}]->(go);

MATCH (p:Protein {sid: "P02768"}), (go:GO {sid: "GO:0006955"})
MERGE (p)-[:ASSOCIATED_WITH {source: "GO"}]->(go);

// ============================================
// RELATIONSHIPS: PROTEIN-PROTEIN INTERACTIONS
// ============================================

MATCH (p1:Protein {sid: "P48736"}), (p2:Protein {sid: "P31749"})
MERGE (p1)-[:INTERACTS_WITH {source: "STRING", score: 0.9}]->(p2);

MATCH (p1:Protein {sid: "P31749"}), (p2:Protein {sid: "P42345"})
MERGE (p1)-[:INTERACTS_WITH {source: "STRING", score: 0.95}]->(p2);

MATCH (p1:Protein {sid: "P31749"}), (p2:Protein {sid: "P37231"})
MERGE (p1)-[:INTERACTS_WITH {source: "STRING", score: 0.78}]->(p2);

MATCH (p1:Protein {sid: "P16671"}), (p2:Protein {sid: "P48736"})
MERGE (p1)-[:INTERACTS_WITH {source: "STRING", score: 0.65}]->(p2);

Cypher Queries

These example Cypher queries demonstrate how to retrieve and analyze single-omics experimental data stored in the graph database.

Return experiment data

After you have stored your experiment data, it’s important to be able to query and retrieve relevant information.

Find all genes differentially expressed in a given comparison

In this query we take a particular Comparison node (identified by its sid property) and find all associated Gene nodes that are differentially expressed (because they have the FOUND_DIFFERENTIAL relationship) for that comparison. We filter the results based on an adjusted p-value threshold (e.g., < 0.05) to identify statistically significant genes. The query returns the gene symbol, log fold change, p-value, adjusted p-value, and regulation direction (up or down).

// Find all genes differentially expressed in the NAFLD vs Control comparison
WITH
  "COMP001" AS sid,
  0.05 AS adjPValueThreshold
MATCH (comp:Comparison {sid: sid})-[r:FOUND_DIFFERENTIAL]->(gene:Gene)
WHERE r.adjPValue < adjPValueThreshold
RETURN
  gene.symbol AS Gene,
  r.logFC AS LogFoldChange,
  r.pValue AS PValue,
  r.adjPValue AS AdjustedPValue,
  r.regulated AS Direction
ORDER BY r.adjPValue;

Find all genes overexpressed in a given comparison

Taking a particular Comparison node (identified by its sid property) we use filters (adjPValue < 0.05 and regulated = "up") to find all associated Gene nodes that are overexpressed and where that overexpression is statistically significant.

// Find all genes overexpressed in the NAFLD vs Control comparison
WITH
  "COMP001" AS sid,
  0.05 AS adjPValueThreshold,
  "up" AS regulationDirection
MATCH (comp:Comparison {sid: sid})-[r:FOUND_DIFFERENTIAL]->(gene:Gene)
WHERE r.adjPValue < adjPValueThreshold AND r.regulated = regulationDirection
RETURN
  gene.symbol AS Gene,
  r.logFC AS LogFoldChange,
  r.pValue AS PValue,
  r.adjPValue AS AdjustedPValue,
  r.regulated AS Direction
ORDER BY r.adjPValue;

Aggregate experimental data for disease

A disease-centric view can be useful for understanding the molecular signatures associated with specific conditions across multiple experiments.

Find all genes that are overexpressed in multiple comparisons

For this query, we’re starting with the Gene nodes and looking for those that have been found to be overexpressed (i.e., regulated = "up") in multiple Comparison nodes. We count the number of distinct comparisons for each gene where it is overexpressed and filter to only include genes that are overexpressed in at least two comparisons. The results include the gene symbol, the count of times it was overexpressed, and the names of the comparisons where it was overexpressed.

WITH
  0.05 AS adjPValueThreshold,
  "up" AS regulationDirection
MATCH (gene:Gene)<-[r:FOUND_DIFFERENTIAL]-(comp:Comparison)
WHERE r.regulated = regulationDirection AND r.adjPValue < adjPValueThreshold
WITH gene, count(DISTINCT comp) AS compCount, collect(comp.name) AS comparisons
WHERE compCount >= 2
RETURN
  gene.symbol AS Gene,
  compCount AS TimesOverexpressed,
  comparisons AS OverexpressedInComparisons
ORDER BY compCount DESC;

Compare gene signatures between two diseases

Here we’re looking to see how gene expression signatures compare between two diseases. We match Gene nodes that are differentially expressed between two known diseases.

// Compare gene signatures between two diseases
WITH "significant" AS significanceLevel
MATCH
  (gene:Gene)<-[r1:FOUND_DIFFERENTIAL]-(comp1:Comparison)-[:STUDIES_DISEASE]->(d1:Disease {name: "Non-alcoholic fatty liver disease"}),
  (gene)<-[r2:FOUND_DIFFERENTIAL]-(comp2:Comparison)-[:STUDIES_DISEASE]->(d2:Disease {name: "Type 2 Diabetes"})
WHERE
  r1.significance = significanceLevel
  AND r2.significance = significanceLevel
RETURN
  gene.symbol AS SharedGene,
  r1.logFC AS NAFLD_LogFC,
  r2.logFC AS T2D_LogFC,
  r1.regulated AS NAFLD_Direction,
  r2.regulated AS T2D_Direction;

This could be expanded to find all the diseases associated with a given gene, and compare their signatures.

WITH
  "significant" AS significanceLevel
MATCH
  (gene:Gene)<-[r1:FOUND_DIFFERENTIAL]-(comp1:Comparison)-[:STUDIES_DISEASE]->(d1:Disease),
  (gene)<-[r2:FOUND_DIFFERENTIAL]-(comp2:Comparison)-[:STUDIES_DISEASE]->(d2:Disease)
WHERE
  r1.significance = significanceLevel
  AND r2.significance = significanceLevel
  AND d1 <> d2
  AND d1.name < d2.name
RETURN
  gene.symbol AS SharedGene,
  d1.name,
  r1.logFC AS NAFLD_LogFC,
  r1.regulated AS NAFLD_Direction,
  d2.name,
  r2.logFC AS T2D_LogFC,
  r2.regulated AS T2D_Direction;

Collect information for interesting gene(s)

These queries are more gene-centric, focusing on retrieving various types of information related to a specific gene (or genes) of interest.

Return all synonyms for a given gene?

Genes are not named consistently across different databases and resources. To help integrate data from multiple sources, it’s useful to retrieve all known synonyms for a given gene. Storing the synonyms also allows us to map experimental results that may use different names back to the gene name we’re using.

// Find all synonyms for the gene PIK3CG
WITH "PIK3CG" AS geneSymbol
MATCH (g:Gene {symbol: geneSymbol})-[:MAPPED]->(syn:ID)
RETURN syn.sid AS ID, syn.source AS Source
ORDER BY syn.name;

Show interesting result for a given gene

Interesting results are ones that have a differential expression associated with a disease of interest. Here we start with a specific gene and find all comparisons where that gene was found to be differentially expressed, along with the associated disease and relevant statistics.

// Show interesting result for a given gene
WITH "PIK3CG" AS geneSymbol
MATCH (g:Gene {symbol: geneSymbol})<-[:FOUND_DIFFERENTIAL]-(c:Comparison)-[:STUDIES_DISEASE]->(d:Disease)
RETURN g.symbol AS Gene,
       d.name AS Disease,
       r.logFC AS LogFoldChange,
       r.pValue AS PValue,
       r.adjPValue AS AdjustedPValue,
       r.regulated AS Direction;

Show interesting results for a specific tissue type

Starting with with a particular tissue type (e.g., "liver"), we find all samples taken from that tissue, and the associated experiments. From there we traverse to the genes that were measured in those experiments, returning relevant statistics for each gene.

// Show interesting results for a specific tissue type
WITH "liver" AS tissueName
MATCH (t:Tissue {name: tissueName})<-[:TAKEN_FROM]-(s:Sample)-[:HAS_EXPERIMENT]->(e:Experiment)-[r:HAS_VALUE]->(g:Gene)
RETURN
  g.symbol AS Gene,
  t.name AS Tissue,
  r.logFC AS LogFoldChange,
  r.pValue AS PValue,
  r.adjPValue AS AdjustedPValue,
  r.regulated AS Direction;

Integrate across genes and proteins

Our queries can integrate across genes and proteins, leveraging relationships such as gene-protein coding and protein-protein interactions.

Find the network neighborhood of a candidate gene target

We start with a candidate gene of interest (e.g., "PIK3CG") and traverse from the Gene node to the associated Protein node. From there we explore the protein-protein interaction network by following INTERACTS_WITH relationships to neighboring proteins. Finally, we return the gene symbols of the target gene and its neighboring genes, along with the interaction score between the proteins.

// Find the network neighborhood of a candidate gene target
WITH "PIK3CG" AS geneSymbol
MATCH (g:Gene {symbol: geneSymbol})-[:CODES]->(p:Protein)-[i:INTERACTS_WITH]-(neighbor:Protein)<-[:CODES]-(ng:Gene)
RETURN
  g.symbol AS TargetGene,
  ng.symbol AS NeighborGene,
  neighbor.sid AS NeighborProtein,
  i.score AS InteractionScore
ORDER BY i.score DESC;

Trace network neighborhood of regulated genes from comparison to protein interactions

In this query we start from a Comparison node (identified by its sid property) and find all Gene nodes that are differentially expressed in that comparison. From each of these genes, we traverse to the corresponding Protein nodes and then explore their interactions with other proteins in the network. Finally, we return pairs of genes whose proteins interact, along with the interaction score.

// Trace network neighborhood of regulated genes from comparison to protein interactions
WITH "COMP001" AS comparisonSid
MATCH
  (comp:Comparison {sid: comparisonSid})-[:FOUND_DIFFERENTIAL]->(g1:Gene),
  (g1)-[:CODES]->(p1:Protein)-[i:INTERACTS_WITH]->(p2:Protein),
  (p2)<-[:CODES]-(g2:Gene)
RETURN
  g1.symbol AS Gene1,
  g2.symbol AS Gene2,
  i.score AS InteractionScore;

Find enriched pathways for regulated genes in a disease

An enriched pathway analysis can help identify biological processes that are overrepresented among the differentially expressed genes associated with a disease. This query starts with a specific disease of interest (e.g., "Non-alcoholic fatty liver disease"), finds all genes differentially expressed in comparisons studying that disease, and returns the associated biological pathways (GO terms) along with the count of genes involved in each pathway.

// Find enriched pathways in the NAFLD vs Control comparison
WITH "Non-alcoholic fatty liver disease" as diseaseName
MATCH
  (d:Disease {name: diseaseName})<-[:STUDIES_DISEASE]-(comp)-[:FOUND_DIFFERENTIAL]->(gene:Gene),
  (gene)-[:CODES]->(protein:Protein)-[:ASSOCIATED_WITH]->(go:GO)
WITH go, count(DISTINCT gene) AS geneCount
//WHERE geneCount >= 2 // uncomment to filter for pathways with at least 2 genes
RETURN
  go.name AS Pathway,
  go.sid AS GO_ID,
  geneCount AS GenesInPathway
ORDER BY geneCount DESC;