Improving Text2Cypher: Lessons from Schema Filtering and Hard Example Selection

Session Track: Data Intelligence

Session Time: 01:30 pm - 02:00 pm November 6

Session description

Large language models (LLMs) are changing how users interact with databases, enabling natural language interfaces that generate database queries on demand. Text2Cypher systems, for example, let users ask questions like “What are the movies of Tom Hanks?” and get the right Cypher query: `MATCH (actor:Person {name: “Tom Hanks”})-[:ACTED_IN]->(movie:Movie) RETURN movie.title AS movies`.

Although model size and architecture matter, building accurate and efficient Text2Cypher systems also relies on careful data preparation, effective schema design, and curated training examples.

In this session, we’ll share lessons and experimental insights from the Neo4j Text2Cypher project. We’ll cover two practical strategies:

– Schema pruning: Simplifying the model’s input by removing irrelevant parts of the database schema to reduce confusion and improve generation performance.

– Hard example selection: Choosing diverse and challenging training samples to improve model robustness while reducing costs.

Speaker

Makbule Gulcin Ozsoy

Software Developer, Neo4j

Makbule Gulcin Ozsoy is a software developer and machine learning engineer, mainly working on recommender systems, ranking, and information retrieval.