RAG - Retrieval Augmented Generation Procedures

Query with Retrieval-augmented generation (RAG) technique

This procedure apoc.ml.rag takes a list of paths or a vector index name, relevant attributes and a natural language question to create a prompt implementing a Retrieval-augmented generation (RAG) technique.

See here for more info about the RAG process.

It uses the chat/completions API which is documented here.

Table 1. Input Parameters
name description mandatory

paths

the list of paths to retrieve and augment the prompt, it can also be a matching query or a vector index name

yes

attributes

the relevant attributes useful to retrieve and augment the prompt

yes

question

the user question

yes

conf

An optional configuration map, please check the next section

no

Table 2. Configuration map
name description mandatory

getLabelTypes

add the label / rel-type names to the info to augment the prompt

no, default true

embeddings

to search similar embeddings stored into a node vector index (in case of embeddings: "NODE") or relationship vector index (in case of embeddings: "REL")

no, default "FALSE"

topK

number of neighbors to find for each node (in case of embeddings: "NODE") or relationships (in case of embeddings: "REL")

no, default 40

apiKey

OpenAI API key

in case apoc.openai.key is not defined

prompt

the base prompt to be augmented with the context

no, default is:

"You are a customer service agent that helps a customer with answering questions about a service. Use the following context to answer the user question at the end. Make sure not to make any changes to the context if possible when prepare answers to provide accurate responses. If you don’t know the answer, just say `Sorry, I don’t know`, don’t try to make up an answer."

Using the apoc.ml.rag procedure we can reduce AI hallucinations (i.e. false or misleading responses), providing relevant and up-to-date information to our procedure via the 1st parameter.

For example, by executing the following procedure (with the gpt-3.5-turbo model, last updated in January 2022) we have a hallucination

Query call
CALL apoc.ml.openai.chat([
    {role:"user", content: "Which athletes won the gold medal in mixed doubles's curling  at the 2022 Winter Olympics?"}
], $apiKey)
Table 3. Example response
value

The gold medal in curling at the 2022 Winter Olympics was won by the Swedish men’s team and the Russian women’s team.

So, we can use the RAG technique to provide real results. For example with the given dataset (with data taken from this wikipedia page):

wikipedia dataset
CREATE (mixed2022:Discipline {title:"Mixed doubles's curling", year: 2022})
WITH mixed2022
CREATE (:Athlete {name: 'Stefania Constantini', country: 'Italy', irrelevant: 'asdasd'})-[:HAS_MEDAL {medal: 'Gold', irrelevant2: 'asdasd'}]->(mixed2022)
CREATE (:Athlete {name: 'Amos Mosaner', country: 'Italy', irrelevant: 'qweqwe'})-[:HAS_MEDAL {medal: 'Gold', irrelevant2: 'rwerew'}]->(mixed2022)
CREATE (:Athlete {name: 'Kristin Skaslien', country: 'Norway', irrelevant: 'dfgdfg'})-[:HAS_MEDAL {medal: 'Silver', irrelevant2: 'gdfg'}]->(mixed2022)
CREATE (:Athlete {name: 'Magnus Nedregotten', country: 'Norway', irrelevant: 'xcvxcv'})-[:HAS_MEDAL {medal: 'Silver', irrelevant2: 'asdasd'}]->(mixed2022)
CREATE (:Athlete {name: 'Almida de Val', country: 'Sweden', irrelevant: 'rtyrty'})-[:HAS_MEDAL {medal: 'Bronze', irrelevant2: 'bfbfb'}]->(mixed2022)
CREATE (:Athlete {name: 'Oskar Eriksson', country: 'Sweden', irrelevant: 'qwresdc'})-[:HAS_MEDAL {medal: 'Bronze', irrelevant2: 'juju'}]->(mixed2022)

we can execute:

Query call
MATCH path=(:Athlete)-[:HAS_MEDAL]->(Discipline)
WITH collect(path) AS paths
CALL apoc.ml.rag(paths,
  ["name", "country", "medal", "title", "year"],
  "Which athletes won the gold medal in mixed doubles's curling  at the 2022 Winter Olympics?",
  {apiKey: $apiKey}
) YIELD value
RETURN value
Table 4. Example response
value

The gold medal in curling at the 2022 Winter Olympics was won by Stefania Constantini and Amos Mosaner from Italy.

or:

Query call
MATCH path=(:Athlete)-[:HAS_MEDAL]->(Discipline)
WITH collect(path) AS paths
CALL apoc.ml.rag(paths,
  ["name", "country", "medal", "title", "year"],
  "Which athletes won the silver medal in mixed doubles's curling  at the 2022 Winter Olympics?",
  {apiKey: $apiKey}
) YIELD value
RETURN value
Table 5. Example response
value

The gold medal in curling at the 2022 Winter Olympics was won by Kristin Skaslien and Magnus Nedregotten from Norway.

We can also pass a string query returning paths/relationships/nodes, for example:

CALL apoc.ml.rag("MATCH path=(:Athlete)-[:HAS_MEDAL]->(Discipline) WITH collect(path) AS paths",
  ["name", "country", "medal", "title", "year"],
  "Which athletes won the gold medal in mixed doubles's curling  at the 2022 Winter Olympics?",
  {apiKey: $apiKey}
) YIELD value
RETURN value
Table 6. Example response
value

The gold medal in curling at the 2022 Winter Olympics was won by Stefania Constantini and Amos Mosaner from Italy.

or we can pass a vector index name as the 1st parameter, in case we stored useful info into embedding nodes. For example, given this node vector index:

CREATE VECTOR INDEX `rag-embeddings`
FOR (n:RagEmbedding) ON (n.embedding)
OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,
 `vector.similarity_function`: 'cosine'
}}

and some (:RagEmbedding) nodes with the text properties, we can execute:

CALL apoc.ml.rag("rag-embeddings",
  ["text"],
  "Which athletes won the gold medal in mixed doubles's curling  at the 2022 Winter Olympics?",
  {apiKey: $apiKey, embeddings: "NODE", topK: 20}
) YIELD value
RETURN value

or, with a relationship vector index:

CREATE VECTOR INDEX `rag-rel-embeddings`
FOR ()-[r:RAG_EMBEDDING]-() ON (r.embedding)
OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,
 `vector.similarity_function`: 'cosine'
}}

and some [:RagEmbedding] relationships with the text properties, we can execute:

CALL apoc.ml.rag("rag-rel-embeddings",
  ["text"],
  "Which athletes won the gold medal in mixed doubles's curling  at the 2022 Winter Olympics?",
  {apiKey: $apiKey, embeddings: "REL", topK: 20}
) YIELD value
RETURN value