Introducing the Neo4j Text2Cypher (2024) Dataset

Machine Learning Engineer, Neo4j

November 7, 2024

4 min read

Authors: Makbule Gulcin Ozsoy, Leila Messallem, Jon Besga

We’re excited to share the Neo4j Text2Cypher (2024) Dataset with you. It’s designed to help train and benchmark Text2Cypher models with ease.

Figure 1. User wants to write a Cypher query for “What are the movies of Tom Hanks?” A Text2Cypher model translates the input natural language question into Cypher — “MATCH (actor:Person {name: “Tom Hanks”})-[:ACTED_IN]->(movie:Movie) RETURN movie.title AS movies”

A common use of machine learning is translating natural language into programming or domain-specific languages (DSL). A great example of this is the Text2Cypher task, where plain language questions are seamlessly converted into Cypher query language (see Figure 1). This process can be powered by large language models (LLMs) or supervised fine-tuned models, which rely on datasets that pair natural language with Cypher translations.

While several Text2Cypher datasets exist, many are prepared separately, making them hard to use together. That’s where the Neo4j Text2Cypher (2024) Dataset comes in! It brings together instances from publicly available datasets, cleaning and organizing them for smoother use.

Dataset Preparation

We followed a step-by-step approach to prepare the final dataset:

Identified and gathered publicly available datasets
Combined and cleaned the data
Created training and test splits

Step 1 — Exploration and Analysis of Input Datasets

We identified 25 publicly available datasets, including Neo4j resources (e.g., Neo4jLabs), HuggingFace datasets, and academic papers. We then chose 16 based on their license, accessibility, and data fields. The selected datasets:

Neo4jLabs datasets (text2cypher): Named with the suffix ‘neo4jLabs’
Neo4j Internal projects’ datasets: Named with the suffix ‘neo4j’
HuggingFace datasets (text_to_cypher, cypher, text-to-cypher): Named with the suffix ‘hf’
Academic papers — CySpider (SemanticParser4Graph): Text2Sql or Text2Sql2Cypher datasets, named with the suffix ‘cySpider’

Step 2 — Collection and Combination of Data

We combined the datasets into a single format, where each row reformatted to include fields “question”, “schema”, “cypher”, “data_source”, “database_reference_alias”, and “instance_id”, as described in the table below.

*Fields in the dataset (*Database_reference_alias field is described in* Crowdsourcing Text2Cypher dataset).

The combined dataset is cleaned in two steps:

Manual updates — We manually check queries, fix errors by removing or updating unwanted characters (e.g., backticks), and delete irrelevant entries (e.g., “Lorem ipsum …”). We also remove duplicates based on fields [“question”, “cypher”].
Syntax validation — We check each cypher query for syntax errors, and remove any rows with incorrect queries.

Step 3 — Splitting Data

The final step is preparing for the train/test split. We identified three dataset groups:

Train-specific datasets — Files with ‘train’ in the name, used for training
Test-specific datasets — Files with ‘test’ or ‘dev’ in the name, used for testing or validation
Remaining datasets — Files with no specified use

We assigned Train-specific datasets to the training split and Test-specific datasets to the test split. The remaining datasets were split 90:10 for training and testing. Each split was shuffled to prevent overfitting from sequence or repetitive questions.

Dataset Summary

The data preparation resulted in 44,387 instances, with 39,554 instances in the training split and 4,833 instances in the test split. The train and test splits consist 89 percent and 11 percent of the overall data, respectively. Their distribution across data sources is similar, as shown below.

Figure 2. Train and test splits consist of around 89% and 11% of the overall data, respectively, with a similar distribution over the data sources.

Not every instance in the training and test sets has database access indicated by the ‘database_reference_alias’ field, but for those that do, the potential is huge! Here’s a brief overview of instances with database access to support your exploration:

Training Set
– Rows with database access: 22,093
– 55.85 percent of the training set
Test Set
– Rows with database access: 2,471
– 51.12 percent of the test set

The databases available in the ‘database_reference_alias’ field are detailed in Crowdsourcing Text2Cypher dataset. Dive into the dedicated application (Text2Cypher Crowdsourcing App and Neo4j Browser Demo) and see the data firsthand.

Summary

Machine learning is frequently used to natural language into programming or domain-specific languages, like turning plain text into Cypher query language. LLMs or supervised models rely on datasets that pair natural language with Cypher translations.

Existing Text2Cypher datasets can be fragmented, but the new Neo4j-Text2Cypher(2024) Dataset solves this by combining, cleaning, and organizing public data. It includes 44,387 instances — 39,554 for training and 4,833 for testing.

You can access it on HuggingFace. Stay tuned for insights on applying this dataset to various models in future blog posts!

Introducing the Neo4j Text2Cypher (2024) Dataset was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.