Building A Graph & LLM-Powered RAG Application from PDF Documents


A Step-by-step Walkthrough With GenAI-Stack and OpenAI


Sculptures of lifeguards at Geelong Beach, Victoria, Australia. Photo by author.

Abstract

In this article, I will walk through all the required steps for building a RAG application from PDF documents, based on the thoughts and experiments in my previous blog posts.

Project repository: github.

GitHub – Joshua-Yu/graph-rag: Graph based retrieval + GenAI = Better RAG in production

Overview

There have been more than enough articles and products discussing how to build a better Retrieval Augmented Generation (RAG) solution. In the past, I explained key aspects of RAG, mainly from knowledge storing, indexing, and retrieval using property graph databases for both unstructured and structured data.

Here, I’d like to bring all the pieces together by demonstrating a sample project covering an end-to-end pipeline, from parsing and ingesting PDF documents to knowledge graph creation and retrieving a graph for given natural language questions.

Key Solution Components

1. PDF Document Parsing & Content Extraction

LLM Sherpa (github) is a python library and API for PDF document parsing with hierarchical layout information, e.g., document, sections, sentences, table, and so on.

Compared to normal chunking strategies, which only do fixed length plus text overlapping, being able to preserve document structure can provide more flexible chunking and hence enable more complete and relevant context for generation.

2. Neo4j AuraDB for Knowledge Store

Neo4j AuraDB is a fully managed cloud service provided by Neo4j, Inc., offering the popular graph database as a cloud-based solution. It’s designed to provide users with the powerful capabilities of the Neo4j graph database without the complexity of managing the infrastructure.

AuraDB has a free tier to experiment and try out its features. For detailed steps to create your own instance, you can follow the online documentation or this article:

Adding Q&A Features to Your Knowledge Graph in 3 Simple Steps

3. Python + Neo4j Driver for Data Ingestion

Contents in PDF documents are loaded into Neo4j via the Python Driver using Cypher query language.

4. Neo4j Vector Index for Semantic Search

Neo4j provides native indexes for standard data types, free-style text, and vectors generated by text embedding procedures.

If text embedding and vector are new to you, here is a post describing the concept and samples of usage:

Text Embedding — What, Why and How?

5. GenAI-Stack for Fast Prototyping

The GenAI Stack is a pre-built development environment created by Neo4j in collaboration with Docker, LangChain, and Ollama. This stack is designed for creating GenAI applications, particularly focusing on improving the accuracy, relevance, and provenance of generated responses in LLMs (Large Language Models) through RAG.

Fast Track to Mastery: Neo4j GenAI Stack for Efficient LLM Applications

In our project, we only need the LangChain part for the quick development of a chat application.

6. OpenAI Models for Embedding & Text Generation

OpenAI’s embedding model, text-embedding-ada-002, and LLM GPT-4 are used, so you need an OpenAI API key.

Project Walkthrough

1. Prepare

1.1 Clone the sample project from the repository: github .

1.2 Create a Neo4j AuraDB Free instance.

1.3 Install Dependencies

  • llmsherpa (github)
  • Neo4j GenAI-Stack (github), which already bundled LangChain and Streamlit (front-end)
  • Neo4j driver for Python

1.4 OpenAI API key

2. PDF Document Parsing & Loading Into Neo4j

Under the sample project folder, customize and run the notebook LayoutPDFReader_KGLoader by:

  • providing the location for PDF documents, either file system or URL
  • updating Neo4j AuraDB connection details
  • running initialiseNeo4j() to create constraints and index (only once)
  • running ingestDocumentNeo4j() to load all contents of a document into the graph database. For now, text chunks and tables are supported.

3. Generate and Store Embeddings for Text and Table Name

Open and run the notebook KGEmbedding_Populate to generate and store embeddings to Embedding nodes, after providing details of the OpenAI API key and Neo4j AuraDB connection details.


# First parameter is the node label,
# second is the node property name which contains text to be embedded

LoadEmbedding("Chunk", "sentences")

LoadEmbedding("Table", "name")

Now, we have created a document graph with the following schema:

Document Graph Schema. Dashed arrows are to be created in the future.

4. Prepare Chat Application

Go to the location of the cloned project genai-stack, and copy files and sub-folder under genai-stack folder from the sample project to it. Here is a brief description:

  • chains.py: the updated Python file, which contains a new procedure of structure-aware retriever

# from line 154
def configure_qa_structure_rag_chain(llm, embeddings, embeddings_store_url, username, password):
# RAG response based on vector search and retrieval of structured chunks

...
...
  • cs_bot_papers.py: the chat front-end based on Streamlit and the new retriever.

For a more detailed explanation of this structure-aware retriever, please check my other blog post:

Adding Structure-Aware Retrieval to GenAI Stack

5. Run the chat application

Now it’s ready to go! Run the Streamlit application by typing (you may use a slightly different command line depending on your Python environment):


python -m streamlit run cs_bot_papers.py

To show the function, I tested loading a relatively new paper, Natural Language is All a Graph Needs (arxiv), and asked a question about its “instructGLM” approach.

When RAG is disabled, GPT-4 couldn’t give a correct answer because its knowledge cutoff is April 2023:

When RAG is enabled, the application

  • first generates a text embedding of the question,
  • which is then used for a vector search in Neo4j AuraDB,
  • for the top matched chunks, related contents of the same section are retrieved (structure-aware retrieval part)
  • which is passed with the user question to GPT-4 to generate an answer.

This uses the so-called structure-aware retrieval. From the screenshot below, we can see that not only answers but also locations of references are given that have been passed as metadata from the retrieved information.

Further Discussions

I hope you can follow the steps above and successfully run your own RAG chat application powered by a graph database.

Using Neo4j to store knowledge not only takes advantage of its flexible model, versatile indexing features, and fast query performance, but also the power of the Cypher query language for more potentials like complex retrieval.

Generative AI

I will continue the journey of adding more critical features to this sample project in my future posts.

If you’d love what you have learned so far, feel free to follow me or connect with me on LinkedIn. I’d love to hear your feedback too!


Building A Graph+LLM Powered RAG Application from PDF Documents was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.