Why Vector Search Didn’t Work for Your RAG Solution?


Exploring the Shortcomings of Text Embedding Retrieval for LLM Generation


Loch Awe in Scotland, photo by author.

Abstract

External knowledge is the key to resolving the problems of LLMs such as hallucination and outdated knowledge, which can make LLMs generate more accurate and reliable responses through retrieval-augmented generation (RAG). However, LLMs cannot always respond as expected with
RAG. By analyzing real-world examples, this article demonstrates several categories of the concern and instances where these limitations manifest, leading to inaccuracies in LLM-generated content.

Background

The Movie Graph

To make this easy to understand for beginners to machine learning and Large Language Models, I will explore the concepts using the Movie graph stored in Neo4j.

The blog post below has more complete guidance on the Movie Graph, and how to generate and store text embeddings for nodes and patterns, which are then used for answer generation by searching for the most similar texts for a given question.

Future of Knowledge Graph: Will Structured and Semantic Search Come into One?

A Lightweight RAG Solution

Again, to further reduce the learning curve of any LLM / RAG framework, e.g., LangChain and LlamaIndex, I will only use Cypher and APOC procedures provided by Neo4j graph DBMS.

You may find more details of this lightweight approach to building RAG applications in my previous post.

Building Lightweight RAG Applications Using Neo4j

LLM Models

The following tests were done using gpt-4–0613, the most recent and also most advanced LLM. For the text embedding model, text-embedding-ada-002 was used for both knowledge content and questions.

Examples of Query Results

All movies Tom Hanks has acted in (referenced as TOM-HANKS-MOVIES):

╒════════════════════════╤══════════╕
│m.title │m.released│
╞════════════════════════╪══════════╡
│"Joe Versus the Volcano"│1990 │
├────────────────────────┼──────────┤
│"A League of Their Own" │1992 │
├────────────────────────┼──────────┤
│"Sleepless in Seattle" │1993 │
├────────────────────────┼──────────┤
│"Apollo 13" │1995 │
├────────────────────────┼──────────┤
│"That Thing You Do" │1996 │
├────────────────────────┼──────────┤
│"You've Got Mail" │1998 │
├────────────────────────┼──────────┤
│"The Green Mile" │1999 │
├────────────────────────┼──────────┤
│"Cast Away" │2000 │
├────────────────────────┼──────────┤
│"The Polar Express" │2004 │
├────────────────────────┼──────────┤
│"The Da Vinci Code" │2006 │
├────────────────────────┼──────────┤
│"Charlie Wilson's War" │2007 │
├────────────────────────┼──────────┤
│"Cloud Atlas" │2012 │
└────────────────────────┴──────────┘

All movies Tom Cruise has acted in (referenced as TOM-CRUISE-MOVIES):


╒════════════════╤══════════╕
│m.title │m.released│
╞════════════════╪══════════╡
│"Top Gun" │1986 │
├────────────────┼──────────┤
│"A Few Good Men"│1992 │
├────────────────┼──────────┤
│"Jerry Maguire" │2000 │
└────────────────┴──────────┘

How Similarity Is Decided

To explore how vector search works, let’s start with a simple question:

Who is Tom Hanks?

Here is the Cypher to generate an answer using text embedding search via a vector index in Neo4j. In order to retrieve enough contents, I use top_k = 200 for returned top-matching embeddings.

:param question=>'Who is Tom Hanks?';
:param top_k=>200;

// 1. Get text embedding for the question
CALL apoc.ml.openai.embedding([$question],NULL , {})
YIELD index, text, embedding
// 2. Search for similar embeddings via vector index
WITH text, embedding
CALL db.index.vector.queryNodes($vector_index, $top_k, embedding) YIELD node, score
WITH node, score
RETURN $question, node.text AS context, score;

Text embedding similarity search results:

╒═══════════════════╤══════════════════════════════════════════════════════════════════════╤══════════════════╕
│$question │context │score │
╞═══════════════════╪══════════════════════════════════════════════════════════════════════╪══════════════════╡
│"Who is Tom Hanks?"│"Person name Tom Hanks born 1956 " │0.955516517162323 │
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Tom Hanks] ACTED_IN Movie [You've Got Mail]" │0.9315997362136841│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Tom Hanks] ACTED_IN Movie [That Thing You Do]" │0.9315773248672485│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Tom Hanks] ACTED_IN Movie [Cast Away]" │0.9312677383422852│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Tom Hanks] ACTED_IN Movie [Apollo 13]" │0.9302988052368164│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Tom Hanks] DIRECTED Movie [That Thing You Do]" │0.9302213788032532│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Tom Hanks] ACTED_IN Movie [A League of Their Own]" │0.9287475347518921│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Tom Hanks] ACTED_IN Movie [The Polar Express]" │0.9279540181159973│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
... ... (192 more rows to follow)

The text embedding search works quite well here, as it has pulled out all facts about Tom Hanks, starting from the Person node representing him.

What’s interesting here is the row Person [Tom Hanks] DIRECTED Movie [That Thing You Do], which seems to have a slightly higher similarity score than the rows after it, which are talking about movies Tom Hanks acted in, same as the rows before it. Why is this? I think this was likely due to the data used in the pre-training stage having more mentions about this movie directed by Tom Hanks. Given this din’t significantly change the answer, let’s move on.

When all Tom Hanks-related records are listed, we have ones that are related (similarity score > 0.91) to him. For the next four, I kind of understand, as they all have Tom as the first name.

This row, “Person name Bill Paxton born 1955, “ was a bit strange at first glance as it seems to have no connection with Tom Hanks at all. If I keep going down the list, I realize, in fact, Bill Paxton is another actor in Apollo 13 (co-actor of Tom Hanks).

... ... (14 records upfront) 

│"Who is Tom Hanks?"│"Person name Tom Cruise born 1962 " │0.9114127159118652│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person name Tom Skerritt born 1933 " │0.910643458366394 │
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person name Tom Tykwer born 1965 " │0.9104374647140503│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Tom Cruise] ACTED_IN Movie [A Few Good Men]" │0.902335524559021 │
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person name Bill Paxton born 1955 " │0.9013911485671997│
├───────────────────┼──────────────────────────────────────────────────────────────────────┼──────────────────┤
│"Who is Tom Hanks?"│"Person [Meg Ryan] ACTED_IN Movie [You've Got Mail]" │0.9006932973861694│

Apparently, this is what gpt-4 learned from the massive data during its pre-training process, which made it think that Apollo 13 is quite an important movie, and so are its co-actors. This is also the power of LLM-based semantic search, which already has sufficient knowledge from the common sense / public knowledge domains.

Results of Generation

The Cypher query to combine retrieval and generation is given below. It also saved questions and matched embedding node IDs, together with generated answers, in the graph database for future evaluation.

// 1. Get text embedding for the question
CALL apoc.ml.openai.embedding([$question],NULL , {})
YIELD index, text, embedding
// 2. Search for similar embeddings via vector index
WITH text, embedding
CALL db.index.vector.queryNodes($vector_index, $top_k, embedding) YIELD node, score
// 3. Retrieve text
WITH node, score
WITH collect(node.text) AS sentences, collect(id(node)) AS nodeIds, collect(score) AS scores
CREATE (q:Question) SET q.text = $question, q.ids = nodeIds, q.scores = scores
// 4. Prepare context by concatenating texts
WITH q, sentences
WITH reduce(text='', x IN sentences | text + x + '\n') + '\n----\n' AS context, q
// 5. Let LLM generate answers for the prompt
WITH "You are a customer service agent that helps a customer with answering questions about movies. Use the following context to answer the question at the end. Make sure not to make any changes to the context if possible when prepare answers so as to provide accuate responses. If the answer can't be found in context, just say that I don't know, don't try to make up an answer.\n\n----Context\n"
AS system_message,
"\n----\nQuestion:" AS user_message, context, q
CALL apoc.ml.openai.chat([
{role:"system", content:system_message + context},
{role:"user", content:$question}
], NULL, {model:'gpt-4'}) YIELD value
WITH q, context, value
SET q.answer = value.choices[0].message.content, q.created = value.created, q.finish_reason = value.choices[0].finish_reason
RETURN $question, context, value;

Using the retrieved context, gpt-4 has generated the below answer:

Tom Hanks is a person born in 1956 who has acted in multiple movies, including “You’ve Got Mail”, “That Thing You Do”, “Cast Away”, “Apollo 13”, “A League of Their Own”, “The Polar Express”, “Joe Versus the Volcano”, “Cloud Atlas”, “Sleepless in Seattle”, “Charlie Wilson’s War”, and “The Green Mile”. He also directed the movie “That Thing You Do”.

I must admit that the answer is quite accurate, summarized, and complete, as it covers knowledge sourced from personal info, movies acted, and movies directed by Tom Hanks.

Unrelated Noise

From the sample above, it is fair to say that the retrieval process will inevitably pull out many less relevant contents. As a result, for LLM to generate accurate answers from them, LLM would require strong noise robustness.

Noise Robustness means a model can extract useful information from noisy documents. Here, we define noisy documents/contents as those that are relevant to the question but do not contain any information of the answer. For the sample above, the noisy records are those after row 14 and all the way down to 200 (which is limited by top_k). That is more than 92%!

However, as there is no pre-knowledge of how many records in the graph are related to Tom Hanks, finetuning top_k wouldn’t be a viable approach. Tuning the chunking size of the document isn’t either.

Unfortunately, gpt-4 didn’t always make it right when there were noisy contents. Let’s test another question:

How many movies did Tom Hanks act in before 1995, and what are they?

Retrieved context after concatenation:

"Person  name Tom Hanks born 1956 
Person [Tom Hanks] ACTED_IN Movie [That Thing You Do]
Person [Tom Hanks] ACTED_IN Movie [A League of Their Own]
Person [Tom Hanks] ACTED_IN Movie [You've Got Mail]
Person [Tom Hanks] ACTED_IN Movie [Sleepless in Seattle]
... ... ... ...
(row #93)
Movie title Sleepless in Seattle tagline What if someone you never met, someone you never saw, someone you never knew was the only someone for you? released 1993
... ... ... ...
(row #162)
Movie title A League of Their Own tagline Once in a lifetime you get a chance to do something different. released 1992
... ... ... ...
(200 rows in total)

Generated answer:

The context does not provide specific release dates for all the movies Tom Hanks acted in. Therefore, I cannot accurately determine which movies he acted in before 1995.

Correct answer: check the TOM-HANKS-MOVIES above. There should be at least three movies:

╒════════════════════════╤══════════╕
│m.title │m.released│
╞════════════════════════╪══════════╡
│"Joe Versus the Volcano"│1990 │
├────────────────────────┼──────────┤
│"A League of Their Own" │1992 │
├────────────────────────┼──────────┤
│"Sleepless in Seattle" │1993 │
├────────────────────────┼──────────┤

It looks like gpt-4 forgot what are in the middle of the context. There is a paper discussing this observation, the so-called Lost in the Middle, which may shed some insightful light on it.

Reasoning of Simple Maths

LLM generally did well in logical reasoning, but poorly in reasoning of even simple math-related questions.

Question:

How many movies did Tom Hanks act in before 2000, and what are they?

Generated answer:

Tom Hanks acted in 9 movies before 2000. These are “That Thing You Do”, “A League of Their Own”, “You’ve Got Mail”, “Sleepless in Seattle”, “Joe Versus the Volcano”, “Apollo 13”, “The Green Mile”, “The Polar Express”, and “The Da Vinci Code”.

The Polar Express shouldn’t be in the answer as it was released in 2004.

Information Integration

Information Integration evaluates whether the model can answer complex questions that require integrating information from multiple documents. For RAG, the challenge really appears in the retrieval process, i.e., how much is enough for retrieved content. There is also a challenge during the generation step.

Question:

At the age between 40 and 50, what movies did Tom Hanks and Tom Cruise act in?

Generated answer:

Tom Hanks was born in 1956, so between the ages of 40 and 50, that would be 1996–2006. In this period, he acted in the following films: “That Thing You Do”, “Saving Private Ryan”, “Charlie Wilson’s War”, “The Da Vinci Code”, “You’ve Got Mail”, “Cast Away”, “Cloud Atlas”, “The Green Mile” and “The Polar Express”.

Tom Cruise was born in 1962, so between 40 and 50, that would be 2002–2012. However, in the provided context, there are no films listed that Tom Cruise acted in during this period.

The answer not only gave an almost accurate result, but it also gave the logical steps to get the result. Quite amazing! However, Cloud Atlas shouldn’t be in the answer as it was released in 2012. There needs to be a way to let LLM validate the generated answer against its own reasoning.

Negative Rejection

Negative Rejection means that LLMs should reject to answer the question when the required knowledge is not present in the retrieved context. LLMs are expected to indicate “insufficient information” or other rejection signals. For RAG implementation, we usually add instructions in the prompt like this:

You are a customer service agent who helps customers answer questions about movies. Use the following context to answer the question at the end. Make sure not to make any changes to the context, if possible, when preparing answers so as to provide accurate responses. If the answer can’t be found in context, just say that I don’t know; don’t try to make up an answer.

Question:

In which year did Tom Hanks direct the movie Lord of The Rings?

Generated answer:

I don’t know.

This is aligned with what was in the instruction above.

Conflicting Knowledge Detection

Conflicting Knowledge Detection in a Large Language Model (LLM) refers to the model’s ability to identify and handle instances where it encounters contradictory information within its given context for answer generation. Effective conflicting knowledge detection allows the LLM to recognize these discrepancies and either choose the most reliable information, provide multiple perspectives, or acknowledge the conflict when generating responses. This contributes to the accuracy and reliability of the information provided by the LLM.

Let me inject a conflicting piece of text into the beginning of retrieved context:

"Person [Tom Hanks] did not DIRECT Movie [That Thing You Do] 
Person [Tom Hanks] DIRECTED Movie [That Thing You Do]
Person [Tom Hanks] ACTED_IN Movie [You've Got Mail]
Person [Tom Hanks] ACTED_IN Movie [A League of Their Own]
Person [Tom Hanks] ACTED_IN Movie [That Thing You Do]
... ... (196 more rows to come)

Question:

Which movies were directed by Tom Hanks?

Generated answer:

The movie directed by Tom Hanks is “That Thing You Do”.

Regardless of where I injected the conflicting text, either in the beginning or at the end, the results were the same. It looks like, somehow, gpt-4 just ignored the conflict and believed firmly that Tom Hanks directed the movie That Thing You Do.

Counterfactual Robustness

Counterfactual Robustness evaluates whether the model can identify risks of known factual errors in the retrieved context through instruction. To test, we can ask questions that the LLMs can answer directly, but the retrieved contexts contain factual errors.

Here, let me inject a false fact into the beginning of retrieved context:

"Person [Tom Hanks] DIRECTED Movie [Lord of the Rings] 
Person [Tom Hanks] DIRECTED Movie [That Thing You Do]
Person [Tom Hanks] ACTED_IN Movie [You've Got Mail]
Person [Tom Hanks] ACTED_IN Movie [A League of Their Own]
Person [Tom Hanks] ACTED_IN Movie [That Thing You Do]
Person name Tom Hanks born 1956
... .. (195 more rows to come)

Question:

Which movies were directed by Tom Hanks?

Generated answer:

Tom Hanks directed the movies “Lord of the Rings” and “That Thing You Do”.

In this case, even advanced LLM like gpt-4 failed to give any warning on the apparently false fact.

Summary

While revolutionary in capturing semantic meanings, text embeddings often struggle with context sensitivity, contextual meaning, and evolving language use (source paper). While applied to RAG-like solutions, the retrieved content based on embedding-based similarity search methods may consequently impact the accuracy and correctness of generation in Large Language Models (LLMs).

Even though RAG can improve the response accuracy of LLMs, they still suffer from the abovementioned challenges significantly.

  • Context sensitivity
  • Unrelated noise
  • Reasoning of maths
  • Information integration
  • Negative rejection
  • Conflicting knowledge detection
  • Counterfactual Robustness

Some of the challenges can be tackled and improved by finetuning a domain-specific embedding model, some would need more advanced retrieval strategies to combine vector search with other search techniques.

A blog post on this can be found here:

Enhance Semantic Search of Text Embeddings through Collaborative Filtering over A Knowledge Graph

There are also advanced Neo4j RAG strategies to use in LangChain:

langchain/templates/neo4j-advanced-rag at master · langchain-ai/langchain

When implementing a RAG solution, there needs to be specific test cases and evaluation metrics defined for the scenarios related to those limitations.


Why Vector Search Didn’t Work for Your RAG Solution? was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.