Context-Aware Knowledge Graph Chatbot With GPT-4 and Neo4j

Learn how to implement a chatbot that bases its answers on the information retrieved from a graph database

Not so long ago, OpenAI added Chat API, which is optimized for generating conversations. The main difference between Chat and the older Completion APIs is that the Chat API allows specifying the conversation history. While the Completion API could be used for single-turn tasks like text generation, summarization, and similar, the Chat API is designed more for conversational tasks like chatbots or customer support. The ability to provide a dialogue history or the context of the conversation gives the model a better ability to answer additional or follow-up questions.

Just a month ago, I wrote a post about how I implemented a knowledge-graph-based chatbot. There, I used a Completion API, specifically text-davinci-003 model, to generate Cypher statements based on user prompts. However, since the Completion API is unaware of the conversation history, we must be very specific in the prompts about the entities we are interested in. For example, we cannot use pronouns like it, they, or she in follow-up questions as the model has no context to what they refer to.

However, we can solve this problem by using the Chat API. Therefore, I have decided to try out the new GPT-4 as well as the GPT-3.5-turbo models and evaluate how they could be used for a chatbot that fetches its information from a knowledge graph.

Knowledge Graph-Based Chatbot Design

Chatbots can be used for various applications, as shown by the myriad applications popping up lately. One significant problem with LLMs like ChatGPT is that they can confidently answer any question while possibly providing false information, also known as hallucinations.

Hallucinations are acceptable in some domains and not in others. It is also hard to retrain a model not to hallucinate specific information. However, if we used a knowledge base like a graph database to base the chatbot’s answer on, we would have complete and total control over the answers the chatbot might provide.

Context-aware knowledge graph chatbot design. Image by the author.

It all begins when a user inputs their prompt. The prompt and the dialogue history get sent to GPT-4 endpoint to generate a Cypher statement. The returned Cypher statement is then used to query the Neo4j database. While we could directly return the database results to the user, it is a better user experience if we send the database results along with the dialogue context to generate natural language-looking text as an answer. Here, we use the GPT-3.5-turbo model to create an answer, as the model is good enough and much cheaper than the GPT-4. Finally, we send the generated response back to the user.

It sounds surprisingly easy and is also quite simple to implement. The code is available on GitHub.

Neo4j Environment & Dataset

First, we will configure the Neo4j environment. We will use the dataset available as the recommendations project in the Neo4j sandbox, which is based on the MovieLens dataset. The easiest solution is simply to create a Neo4j Sandbox instance by following this link. However, if you would prefer a local instance of Neo4j, you can also restore a database dump that is available on GitHub.

After the Neo4j database is instantiated, we should have a graph with the following schema populated.

Graph schema. Image by the author.

The database contains information about movies, actors, directors, and genres. Additionally, it contains ratings about movies by users. We can evaluate the size of the database with the following APOC procedure.

CALL apoc.meta.stats YIELD
nodeCount, relCount, labels


Database size. Image by the author.

The database has almost 30 thousand nodes and slightly more than 165 thousand relationships. The labels column provides the count of nodes per label. For example, there are nearly 10 thousand movies and 20 thousand people. It seems that a Person node can have a secondary label Actor or Director. If you want to get to know the dataset or Cypher a bit more first, you can follow the browser guide that is part of the Sandbox environment.

English2Cypher With GPT-4

Generating Cypher statements based on user input is vital to our chatbot. It needs to be able to reliably generate valid Cypher statements, as there is no human in the loop who can fix the queries if they are not working. Additionally, the model should be able to distill context from dialogue history.

The OpenAI’s ChatCompletion endpoint currently supports only GPT-3.5-turbo and GPT-4 models. I first tried the GPT-3.5-turbo model but quickly realized that it is not a good fit for generating Cypher statements based on dialogue history as it often ignores instructions not to apologize or explain results. There are some rumors that GPT-3.5-turbo is Canadian, as it likes to apologize for apologizing too much. Additionally, GPT-4 is better at understanding context and learning from the training examples.

You can use GPT-3.5-turbo as well if you don’t have access to GPT-4 yet. The code includes cleaning the results of unwanted apologies and explanations.

First, we have to define the system message. In it, we define the role of the model as being a Cypher statement generator, along with some constraints like that it shouldn’t ever explain or apologize. Additionally, we don’t want it to construct any Cypher statements that cannot be inferred based on the training examples.

system = f"""
You are an assistant with an ability to generate Cypher queries based off example Cypher queries.
Example Cypher queries are: \n {examples} \n
Do not response with any explanation or any other information except the Cypher query.
You do not ever apologize and strictly generate cypher statements based of the provided Cypher examples.
You need to update the database using an appropriate Cypher statement when a user mentions their likes or dislikes, or what they watched already.
Do not provide any Cypher statements that can't be inferred from Cypher examples.
Inform the user when you can't infer the cypher statement due to the lack of context of the conversation and state what is the missing context.

For example, suppose we don’t put the constraint that it should only use information in the training examples. In that case, it can generate Cypher statements for any user inputs, no matter how inaccurate.

I have evaluated ChatGPT’s ability to generate Cypher statements based on providing graph schema or sample Cypher statements on a couple of different datasets. However, supplying sample Cypher statements always seemed the better option, so this is now my default approach to using GPT models to generate Cypher statements.

We will be using the following examples to feed into GPT-4.

examples = """
# I have already watched Top Gun
MATCH (u:User {id: $userId}), (m:Movie {title:"Top Gun"})
MERGE (u)-[:WATCHED]->(m)
RETURN distinct {answer: 'noted'} AS result
# I like Top Gun
MATCH (u:User {id: $userId}), (m:Movie {title:"Top Gun"})
MERGE (u)-[:LIKE_MOVIE]->(m)
RETURN distinct {answer: 'noted'} AS result
# What is a good comedy?
MATCH (u:User {id:$userId}), (m:Movie)-[:IN_GENRE]->(:Genre {name:"Comedy"})
RETURN {movie: m.title} AS result
ORDER BY m.imdbRating DESC LIMIT 1
# Who played in Top Gun?
MATCH (m:Movie)<-[:ACTED_IN]-(a)
RETURN {actor:} AS result
# What is the plot of the Copycat movie?
MATCH (m:Movie {title: "Copycat"})
RETURN {plot: m.plot} AS result
# Did Luis Guzmán appear in any other movies?
MATCH (p:Person {name:"Luis Guzmán"})-[:ACTED_IN]->(movie)
RETURN {movie: movie.title} AS result
# Recommend a movie
MATCH (u:User {id: $userId})-[:LIKE_MOVIE]->(m:Movie)
MATCH (m)<-[r1:RATED]-()-[r2:RATED]->(otherMovie)
WHERE r1.rating > 3 AND r2.rating > 3 AND NOT EXISTS {(u)-[:WATCHED|LIKE_MOVIE|DISLIKE_MOVIE]->(otherMovie)}
WITH otherMovie, count(*) AS count
RETURN {recommended_movie:otherMovie.title} AS result

We have provided only seven Cypher examples. However, GPT-4 has a good grasp of Cypher and can use these examples to generate Cypher statements that don’t appear in the training example. For the most part, we teach the model which relationships and properties to use when traversing or filtering nodes.

Now, we can define the function that generates Cypher statements based on the conversation.

@retry(tries=2, delay=5)
def generate_cypher(messages):
messages = [
{"role": "system", "content": system}
] + messages
# Make a request to OpenAI
completions = openai.ChatCompletion.create(
response = completions.choices[0].message.content
# Sometime the models bypasses system prompt and returns
# data based on previous dialogue history
if not "MATCH" in response and "{" in response:
raise Exception(
"""GPT bypassed system message and is returning response
based on previous conversation history""" + response)
# If the model apologized, remove the first line
if "apologi" in response:
response = " ".join(response.split("\n")[1:])
# Sometime the model adds quotes around Cypher when it wants to explain stuff
if "`" in response:
response = response.split("```")[1].strip("`")
return response

If the model apologizes, it is always in the first line or sentence of the response. Additionally, when it offers additional unwanted explanations, it is kind enough to put “` quotes around the Cypher statements. These two data cleaning steps are more relevant for the GPT-3.5-turbo model.

We also raise an exception if the model completely bypasses the system message and returns a message from previous conversations. You might better understand this scenario through an example.

{'role': 'user', 'content': 'What are some good cartoon?'},
{'role': 'assistant', 'content': 'Shrek 3'},
{'role': 'user', 'content': 'Which actors appeared in it?'}]))
# MATCH (m:Movie {title: "Shrek 3"})<-[:ACTED_IN]-(a:Person)
# RETURN {actor:} AS result

In this example, we provide the model with some conversation history. For the model to learn the context of the conversation, we don’t offer the previous Cypher statements it generated but the answers we got from Neo4j using those Cypher statements. If you use GPT-3.5-turbo, it will apologize for not providing a Cypher statement in the previous conversation, even when you tell it not to do that five times. However, GPT-4 will not do that. However, it will learn based on the provided conversation history and sometimes altogether bypass the system message and return the Shrek 3 as a result when it should be generating Cypher statements.

The idea that you shouldn’t input proprietary code or information to GPT is nothing new. It was just a newsflash of how obvious the learning process was, as my conversation history is custom generated and doesn’t include previously generated Cypher statements that the model delivered.

Additionally, we can constrain what the model does when it is provided with an input that cannot be inferred from training examples.

{'role': 'user', 'content': 'What are some good cartoon?'},
{'role': 'assistant', 'content': 'Shrek 3'},
{'role': 'user', 'content': 'Who was the first person on the moon?'}]))
# I can only provide Cypher queries based on the provided examples.
# Please ask a question related to the examples or provide a new Cypher query example.

However, even with the constraints in place, it might not abide by them. Perhaps there are better prompt options to keep GPT-4 in check.


The second piece of the puzzle is the natural text generation based on the results we get from the graph database. Similar to before, we want to include conversation history to make the generated text sound more authentic. Here, we can use the GPT-3.5-turbo as it is much cheaper and good enough to generate natural language text.

Through some trial and error, I developed the following system message.

system = f"""
You are an assistant that helps to generate text to form nice and human understandable answers based.
The latest prompt contains the information, and you need to generate a human readable response based on the given information.
Make it sound like the information are coming from an AI assistant, but don't add any information.
Do not add any additional information that is not explicitly provided in the latest prompt.
I repeat, do not add any information that is not explicitly given.

Adding the instruction to make the information sound like it is coming from an AI assistant makes it sound more like you have a conversation with the bot instead of it summarizing the database results. However, when you tell it to behave like an AI assistant, it wants to add additional information it knows to the responses. We want to avoid that as the GPT-3.5-turbo can hallucinate results, and it would be impossible to differentiate what information came from the database or the model. Therefore, I added three sentences not to add any additional information as GPT-3.5-turbo isn’t as good in following instructions stated only once. The system prompt also tells it not to apologize. However, being Canadian, it can’t help itself not to.

The function to generate natural language from database results is the following.

def generate_response(messages):
messages = [
{"role": "system", "content": system}
] + messages
# Make a request to OpenAI
completions = openai.ChatCompletion.create(
response = completions.choices[0].message.content
# If the model apologized, remove the first line or sentence
if "apologi" in response:
if "\n" in response:
response = " ".join(response.split("\n")[1:])
response = " ".join(response.split(".")[1:])
return response

Here, we needed only to filter out apologies as the model doesn’t feel the need to explain how it generated text.

We can test the function on the following example.

data = [{'actor': 'Sigourney Weaver', 'role': "Witch"}, 
{'actor': 'Holly Hunter', "role": "Assassin"},
{'actor': 'Dermot Mulroney'},
{'actor': 'William McNamara'}]
print(generate_response([{'role': 'user', 'content': str(data)}]))
#The list contains four actors and their respective roles.
#Sigourney Weaver played the role of a witch, while Holly Hunter portrayed an assassin.
#The roles of Dermot Mulroney and William McNamara were not specified.

I noticed that using JSON or dictionary objects is the best way to provide some context along with the database results. That way, we can present the model with the value context using the dictionary key. So, for example, the model now knows that it has been provided with actors and their respective roles.

Chatbot Implementation Using Streamlit

We will use the streamlit-chat package to develop the user interface for our chatbot. It is effortless to use and great for simple demos. I will walk you through the parts that I feel are important and not go through all the code. However, you can always check the chatbot implementation on GitHub.

Example dialogue. Image by the author.

An important part of our chatbot is storing the conversation history along with information about generated Cypher statements and database results. We will store relevant information with the Streamlit’s session_state method.

# Generated natural language
if 'generated' not in st.session_state:
st.session_state['generated'] = []
# Neo4j database results
if 'database_results' not in st.session_state:
st.session_state['database_results'] = []
# User input
if 'user_input' not in st.session_state:
st.session_state['user_input'] = []
# Generated Cypher statements
if 'cypher' not in st.session_state:
st.session_state['cypher'] = []

Additionally, we must define a function that will construct a conversation history in a form that the OpenAI’s ChatCompletion endpoint expects.

def generate_context(prompt, context_data='generated'):
context = []
# If any history exists
if st.session_state['generated']:
# Add the last three exchanges
size = len(st.session_state['generated'])
for i in range(max(size-3, 0), size):
{'role': 'user', 'content': st.session_state['past'][i]})
{'role': 'assistant', 'content': st.session_state[context_data][i]})
# Add the latest user prompt
context.append({'role': 'user', 'content': str(prompt)})
return context

In this example, we construct the dialogue context using only up to three exchanges between the user and the assistant. Therefore, we will send at most six messages (three by the user and three by the assistant). While the user message is always taken from the user_input session state, we can specify which session state should be used as input to the assistant message. By default, we use the generated session state.

Finally, we have to define how to handle user inputs

if user_input:
cypher = generate_cypher(generate_context(user_input, 'database_results'))
# If not a valid Cypher statement
if not "MATCH" in cypher:
print('No Cypher was returned')
"No Cypher statement was generated")
# Query the database, userID is hardcoded
results = run_query(cypher, {'userId': USER_ID})
# Harcode result limit to 10
results = results[:10]
# Graph2text
answer = generate_response(generate_context(
f"Question was {user_input} and the response should include only information that is given here: {str(results)}"))

Every generated Cypher statement should have a MATCH clause in it. If it doesn’t have it, the GPT-4 model is likely trying to tell us something. For example, it won’t return a Cypher statement when it can’t resolve the prompt as no context is given.

Missing context. Image by the author.

Otherwise, if the model can generate a valid Cypher statement based on user input, we use that Cypher statement to query the Neo4j database and use the results to create natural language-sounding answers. Since most Cypher statements in the training example don’t have any LIMIT set, we add a manual limit to display ten results at the most from each query.

Additionally, queries that are updating the database use the userId parameter, which is provided by the application and not the model.

Example Dialogue Flow

We can now test the chatbot on an example flow.

Finding movies by their titles. Image by the author.

The training set contains an example that allows us to find movies based on their titles. In this example, we have searched for Pokémon movies. We can now ask the model a follow-up question.

Follow-up question about actors. Image by the author.

Since we provided the context of the conversation, the model could deduct which movie is being referenced. However, unfortunately, the GPT-3.5-turbo model decided to add additional information about the English version (also known as X in English), even though it was explicitly told not to do that three times. I have tested the same flow with GPT-4 generating natural language text.

GPT-4 is much better at following the rules. Image by author.

Using GPT-4 as a natural language generating model, we were able to avoid adding unwanted additional information by the model.

I’ll be using GPT-4 as text generating model in the following examples. It was messing with me a bit when I wanted to replicate the conversation flow, so I gave up on it. For the most part, GPT-3.5-turbo is good enough, but it has its moments.

We can ask another follow-up question.

Finding other movies. Image by the author.

Here, I got pretty excited. GPT-4 ability to infer new Cypher queries based on the training examples is astounding. It has a good understanding of the Cypher itself, and when it grasps a given graph schema, it performs very well.

We have also added training examples that generate Cypher statements that update the database.

Storing information about watched movies. Image by the author.

Additionally, we can also let it know which movies we like.

Storing information about liked movies. Image by the author.

Now that it knows which movies we like, we can use it to recommend movies based on our likes.

Basic recommendations. Image by the author.

I added a simple recommendation query in the training examples that uses a basic variation of collaborative filtering to recommend movies. Simply put, we utilize the rating of movies available in the graph. If a user liked the same movie as we did, we examine which other movies they also liked that we haven’t watched and recommend the most frequent one.

I was testing out various flows and found it interesting how the chatbot behaved. For example, I tried to make the model second-guess the recommendation.

Trying to make the model second guess. Image by the author.

Here is another example where the model completely overrides the system message. The English2Cypher part has explicit instructions that it needs to generate only Cypher statements and nothing else. However, in this example, it simply reiterated the recommendation as it was available in the conversation history and didn’t generate any Cypher statement.

Note that GPT-4 and GPT-3.5-turbo are not deterministic, which means you might get different results (or not).

However, having the ability to provide conversation history and persist information in the database gives a feeling of great user experience that can be used in many other applications.

Conversation history and storing information in the database allows for human-like dialogues. Image by the author.

Multi-Language Capabilities

Not often mentioned, but both GPT-3.5-turbo and GPT-4 understand many languages. I gave that a little test as well.

GPT-4 multilanguage capabilities. Image by the author

If that is not amazing, I don’t know what is. We provided the model with a few English examples, and now it can be used in many languages. Also, the responses are in an appropriate language.

Lastly, I also wanted to test if it will automatically translate plots as the plots are originally in English.

Plot translation. Image by the author

The plot information provided to GPT-4 was:

"As students at the United States Navy's elite fighter weapons school compete 
to be best in the class, one daring young pilot learns a few things from a
civilian instructor that are not taught in the classroom."

And yes, the model knows it should probably translate the answer to the same language the question was asked in. Imagine all the possibilities of bringing information or content worldwide without worrying too much about translations.


The addition of OpenAI’s Chat Completion endpoint allows us to generate more human-like dialogues. It has excellent value for answering follow-up questions that rely on information provided in the conversation history, making the whole discussion smoother. It can make the user feel like they are talking to someone who understands their meaning.

I would recommend using GPT-4 when possible, especially for generating Cypher statements, as it is better at following instructions and understanding the given task. However, at the moment, I would avoid using any GPT model on private or proprietary data as I have seen firsthand how the model might use previous conversation histories as training examples for the model. With the knowledge-graph-based chatbot, it was fairly obvious. The model should only return Cypher statements, but it somehow decided a couple of times to override system instructions and return the previously provided database results, which were used to understand the conversation context.

As always, all the code is available on GitHub.

Context-Aware Knowledge Graph Chatbot With GPT-4 and Neo4j was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.