A Question-Based Extraction Approach
In this blog post, we introduce an approach that leverages a Graph Retrieval-Augmented Generation (GraphRAG) method of streamlining the process of ingesting commercial contract data and building a Q&A Agent.
This approach diverges from traditional Retrieval-Augmented Generation (RAG) by emphasizing efficiency in data extraction, rather than breaking down and vectorizing entire documents indiscriminately, which is the predominant RAG approach.
In conventional RAG, every document is split into chunks and vectorized for retrieval, which can result in a large volume of unnecessary data being split, chunked, and stored in vector indexes. Here, however, the focus is on extracting only the most relevant information from every contract for a specific use case: Commercial Contract Review. The data is then structured into a knowledge graph, which organizes key entities and relationships, allowing for more precise graph data retrieval through Cypher queries and vector search.
By minimizing the amount of vectorized content and focusing on highly relevant knowledge extracted, this method enhances the accuracy and performance of the Q&A agent, making it suitable for handling complex and domain-specific questions.
The four-stage approach includes targeted information extraction (LLM + prompt) to create a knowledge graph (LLM + Neo4j) and simple set of graph data retrieval functions (Cypher, Text2Cypher, vector search). Finally, a Q&A agent leveraging the data retrieval functions is built with (Microsoft Semantic Kernel).
The diagram below illustrates the approach.
The four-stage GraphRAG approach: From question-based extraction > knowledge graph model > GraphRAG retrieval > Q&A Agent. Image by Sebastian Nilsson @ Neo4j, reproduced here with permission from author.
But first, for those unfamiliar with commercial law, let’s start with a brief intro to the contract review problem.
Contract Review and Large Language Models
Commercial contract review is a labor-intensive process involving paralegals and junior lawyers meticulously identifying critical information in a contract.
“Contract review is the process of thoroughly reading a contract to understand the rights and obligations of an individual or company signing it and assess the associated impact.” -Hendrycks, Burns et al, NeurIPS 2021, in CUAD an Expert-Annotated NLP Dataset for Legal Contract Review
The first stage of contract review involves reviewing hundreds of pages of contracts to find the relevant clauses or obligations. Contract reviewers must identify whether relevant clauses exist, what they say if they do exist, and keep track of where they are described.
For example, They must determine whether the contract is a 3-year contract or a 1-year contract. They must determine the end date of a contract. They must determine whether a clause is, say, an Anti-assignment or an Exclusivity clause…” -Hendrycks, Burns et al, NeurIPS 2021, in CUAD an Expert-Annotated NLP Dataset for Legal Contract Review
It’s a task that demands thoroughness but often suffers from inefficiencies, but it is suitable for a Large Language Model.
Once the first stage is completed, senior law practitioners can start to examine contracts for weaknesses and risks. This is an area where a Q&A agent powered by an LLM and grounded by information stored in a knowledge graph is a perfect co-pilot for a legal expert.
A Four-Step Approach to Build a Commercial Contract Review Agent with LLMs, Function Calling, and GraphRAG
The remainder of this blog will describe each of the steps in this process. Along the way, I will use code snippets to illustrate the main ideas.
The four steps:
- Extracting relevant information from contracts (LLM + Contract)
- Storing information extracted into a knowledge graph (Neo4j)
- Developing simple knowledge graph data retrieval functions (Python)
- Building a Q&A agent handling complex questions (Semantic Kernel, LLM, Neo4j)
The Dataset
The CUAD (Contract Understanding Atticus Dataset) is a CC BY 4.0 licensed and publicly available dataset of over 13,000 expert-labeled clauses across 510 legal contracts, designed to help build AI models for contract review. It covers a wide range of important legal clauses, such as confidentiality, termination, and indemnity, which are critical for contract analysis.
We will use three contracts from this dataset to showcase how our approach to effectively extract and analyze key legal information, building a knowledge graph and leveraging it for precise, complex question answering.
The three contracts combined contain a total of 95 pages.
Step 1. Extracting Relevant Information From Contracts
It is relatively straightforward to prompt an LLM to extract precise information from contracts and generate a JSON output, representing the relevant information from the contract.
In commercial review, a prompt can be drafted to locate each of the critical elements mentioned above — parties, dates, clauses — and summarize them neatly in a machine-readable JSON file.
Extraction Prompt (simplified)
Answer the following questions using information exclusively on this contract [Contract.pdf]
1) What type of contract is this? 2) Who are the parties and their roles? Where are they incorporated? Name state and country (use ISO 3166 Country name) 3) What is the agreement date? 4) What is the effective date?
For each of the following types of contract clauses, extract two pieces of information: a) A Yes/No that indicates if you think the clause is found in this contract b) A list of excerpts that indicate that this clause type exists
Contract Clause types: Competitive Restriction Exception, Non-Compete Clause, Exclusivity, No-Solicit Of Customers, No-Solicit Of Employees, Non-Disparagement, Termination For Convenience, Rofr/Rofo/Rofn, Change Of Control, Anti-Assignment, Uncapped Liability, Cap On Liability
Provide your final answer in a JSON document.
Please note that the above section shows a simplified version of the extraction prompt (see the full version on GitHub). You will find that the last part of the prompt specifies the desired format of the JSON document. This is useful in ensuring a consistent JSON schema output.
This task is relatively simple in Python. The main()
function below is designed to process a set of PDF contract files by extracting relevant legal information (extraction_prompt), using OpenAI GPT-4o and saving the results in JSON format:
def main(): pdf_files = [filename for filename in os.listdir('./data/input/') if filename.endswith('.pdf')] for pdf_filename in pdf_files: print('Processing ' + pdf_filename + '...') # Extract content from PDF using the assistant complete_response = process_pdf('./data/input/' + pdf_filename) # Log the complete response to debug save_json_string_to_file(complete_response, './data/debug/complete_response_' + pdf_filename + '.json')
The “process_pdf” function uses OpenAI GPT-4o to perform knowledge extraction from the contract with an extraction prompt:
def process_pdf(pdf_filename): # Create OpenAI message thread thread = client.beta.threads.create() # Upload PDF file to the thread file = client.files.create(file=open(pdf_filename, "rb"), purpose="assistants") # Create message with contract as attachment and extraction_prompt client.beta.threads.messages.create(thread_id=thread.id,role="user", attachments=[ Attachment( file_id=file.id, tools=[AttachmentToolFileSearch(type="file_search")]) ], content=extraction_prompt, ) # Run the message thread run = client.beta.threads.runs.create_and_poll( thread_id=thread.id, assistant_id=pdf_assistant.id, timeout=1000) # Retrieve messages messages_cursor = client.beta.threads.messages.list(thread_id=thread.id) messages = [message for message in messages_cursor] # Return last message in Thread return messages[0].content[0].text.value
For each contract, the message returned by “process_pdf” looks like:
{ "agreement": { "agreement_name": "Marketing Affiliate Agreement", "agreement_type": "Marketing Affiliate Agreement", "effective_date": "May 8, 2014", "expiration_date": "December 31, 2014", "renewal_term": "1 year", "Notice_period_to_Terminate_Renewal": "30 days", "parties": [ { "role": "Company", "name": "Birch First Global Investments Inc.", "incorporation_country": "United States Virgin Islands", "incorporation_state": "N/A" }, { "role": "Marketing Affiliate", "name": "Mount Knowledge Holdings Inc.", "incorporation_country": "United States", "incorporation_state": "Nevada" } ], "governing_law": { "country": "United States", "state": "Nevada", "most_favored_country": "United States" }, "clauses": [ { "clause_type": "Competitive Restriction Exception", "exists": false, "excerpts": [] }, { "clause_type": "Exclusivity", "exists": true, "excerpts": [ "Company hereby grants to MA the right to advertise, market and sell to corporate users, government agencies and educational facilities for their own internal purposes only, not for remarketing or redistribution." ] }, { "clause_type": "Non-Disparagement", "exists": true, "excerpts": [ "MA agrees to conduct business in a manner that reflects favorably at all times on the Technology sold and the good name, goodwill and reputation of Company." ] }, { "clause_type": "Termination For Convenience", "exists": true, "excerpts": [ "This Agreement may be terminated by either party at the expiration of its term or any renewal term upon thirty (30) days written notice to the other party." ] }, { "clause_type": "Anti-Assignment", "exists": true, "excerpts": [ "MA may not assign, sell, lease or otherwise transfer in whole or in part any of the rights granted pursuant to this Agreement without prior written approval of Company." ] }, { "clause_type": "Price Restrictions", "exists": true, "excerpts": [ "Company reserves the right to change its prices and/or fees, from time to time, in its sole and absolute discretion." ] }, { "clause_type": "Minimum Commitment", "exists": true, "excerpts": [ "MA commits to purchase a minimum of 100 Units in aggregate within the Territory within the first six months of term of this Agreement." ] }, { "clause_type": "IP Ownership Assignment", "exists": true, "excerpts": [ "Title to the Technology and all copyrights in Technology shall remain with Company and/or its Affiliates." ] }, { "clause_type": "License grant", "exists": true, "excerpts": [ "Company hereby grants to MA the right to advertise, market and sell the Technology listed in Schedule A of this Agreement." ] }, { "clause_type": "Non-Transferable License", "exists": true, "excerpts": [ "MA acknowledges that MA and its Clients receive no title to the Technology contained on the Technology." ] }, { "clause_type": "Cap On Liability", "exists": true, "excerpts": [ "In no event shall Company be liable to MA, its Clients, or any third party for any tort or contract damages or indirect, special, general, incidental or consequential damages." ] }, { "clause_type": "Warranty Duration", "exists": true, "excerpts": [ "Company's sole and exclusive liability for the warranty provided shall be to correct the Technology to operate in substantial accordance with its then current specifications." ] } ] } }
Step 2. Creating a Knowledge Graph
With each contract now as a JSON file, the next step is to create a knowledge graph in Neo4j.
At this point, it’s useful to spend some time designing the data model. You need to consider some key questions:
- What do nodes and relationships in this graph represent?
- What are the main properties of each node and relationship?
- Should any be properties indexed?
- Which properties need vector embeddings to enable semantic similarity search?
In our case, a suitable design (schema) includes the main entities: agreements (contracts), their clauses, the organizations who are parties to the agreement, and the relationships among them.
A visual representation of the schema is shown below.
Image by the author.
Node properties: Agreement {agreement_type: STRING, contract_id: INTEGER, effective_date: STRING, expiration_date: STRING, renewal_term: STRING, name: STRING} ContractClause {name: STRING, type: STRING} ClauseType {name: STRING} Country {name: STRING} Excerpt {text: STRING} Organization {name: STRING} Relationship properties: IS_PARTY_TO {role: STRING} GOVERNED_BY_LAW {state: STRING} HAS_CLAUSE {type: STRING} INCORPORATED_IN {state: STRING}
Only the Excerpts — the short text pieces identified by the LLM in Step 1 — require text embeddings. This approach dramatically reduces the number of vectors and the size of the vector index needed to represent each contract, making the process more efficient and scalable.
A simplified version of a Python script loading each JSON into a knowledge graph with the above schema looks like:
NEO4J_URI=os.getenv('NEO4J_URI', 'bolt://localhost:7687') NEO4J_USER=os.getenv('NEO4J_USERNAME', 'neo4j') NEO4J_PASSWORD=os.getenv('NEO4J_PASSWORD') OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') JSON_CONTRACT_FOLDER = './data/output/' driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD)) contract_id = 1 json_contracts = [filename for filename in os.listdir(JSON_CONTRACT_FOLDER) if filename.endswith('.json')] for json_contract in json_contracts: with open(JSON_CONTRACT_FOLDER + json_contract,'r') as file: json_string = file.read() json_data = json.loads(json_string) agreement = json_data['agreement'] agreement['contract_id'] = contract_id driver.execute_query(CREATE_GRAPH_STATEMENT, data=json_data) contract_id+=1 create_full_text_indices(driver) driver.execute_query(CREATE_VECTOR_INDEX_STATEMENT) print ("Generating Embeddings for Contract Excerpts...") driver.execute_query(EMBEDDINGS_STATEMENT, token = OPENAI_API_KEY)
Here, the “CREATE_GRAPH_STATEMENT” is the only complex piece. It is a Cypher statement that maps the contract JSON into the nodes and relationships in the knowledge graph.
The full Cypher statement:
CREATE_GRAPH_STATEMENT = """ WITH $data AS data WITH data.agreement as a MERGE (agreement:Agreement {contract_id: a.contract_id}) ON CREATE SET agreement.contract_id = a.contract_id, agreement.name = a.agreement_name, agreement.effective_date = a.effective_date, agreement.expiration_date = a.expiration_date, agreement.agreement_type = a.agreement_type, agreement.renewal_term = a.renewal_term, agreement.most_favored_country = a.governing_law.most_favored_country //agreement.Notice_period_to_Terminate_Renewal = a.Notice_period_to_Terminate_Renewal MERGE (gl_country:Country {name: a.governing_law.country}) MERGE (agreement)-[gbl:GOVERNED_BY_LAW]->(gl_country) SET gbl.state = a.governing_law.state FOREACH (party IN a.parties | // todo proper global id for the party MERGE (p:Organization {name: party.name}) MERGE (p)-[ipt:IS_PARTY_TO]->(agreement) SET ipt.role = party.role MERGE (country_of_incorporation:Country {name: party.incorporation_country}) MERGE (p)-[incorporated:INCORPORATED_IN]->(country_of_incorporation) SET incorporated.state = party.incorporation_state ) WITH a, agreement, [clause IN a.clauses WHERE clause.exists = true] AS valid_clauses FOREACH (clause IN valid_clauses | CREATE (cl:ContractClause {type: clause.clause_type}) MERGE (agreement)-[clt:HAS_CLAUSE]->(cl) SET clt.type = clause.clause_type // ON CREATE SET c.excerpts = clause.excerpts FOREACH (excerpt IN clause.excerpts | MERGE (cl)-[:HAS_EXCERPT]->(e:Excerpt {text: excerpt}) ) //link clauses to a Clause Type label MERGE (clType:ClauseType{name: clause.clause_type}) MERGE (cl)-[:HAS_TYPE]->(clType) )"""
Here’s a breakdown of what the statement does.
Data Binding:
WITH $data AS data WITH data.agreement as a
$data
is the input data being passed into the query in JSON format. It contains information about an agreement (contract).- The second line assigns
data.agreement
to the aliasa
, so the contract details can be referenced in the subsequent query.
Upsert the Agreement Node:
MERGE (agreement:Agreement {contract_id: a.contract_id}) ON CREATE SET agreement.name = a.agreement_name, agreement.effective_date = a.effective_date, agreement.expiration_date = a.expiration_date, agreement.agreement_type = a.agreement_type, agreement.renewal_term = a.renewal_term, agreement.most_favored_country = a.governing_law.most_favored_country
MERGE
attempts to find an existingAgreement
node with the specified contract_id. If no such node exists, it creates one.- The ON CREATE SET clause sets various properties on the newly created
Agreement
node, such ascontract_id
,agreement_name
,effective_date
, and other agreement-related fields from the JSON input.
Create Governing Law Relationship:
MERGE (gl_country:Country {name: a.governing_law.country}) MERGE (agreement)-[gbl:GOVERNED_BY_LAW]->(gl_country) SET gbl.state = a.governing_law.state
- This creates or merges a
Country
node for the governing law country associated with the agreement. - Then it creates or merges a relationship
GOVERNED_BY_LAW
between theAgreement
andCountry
. - It also sets the
state
property of theGOVERNED_BY_LAW
relationship.
Create Party and Incorporation Relationships:
FOREACH (party IN a.parties | MERGE (p:Organization {name: party.name}) MERGE (p)-[ipt:IS_PARTY_TO]->(agreement) SET ipt.role = party.role MERGE (country_of_incorporation:Country {name: party.incorporation_country}) MERGE (p)-[incorporated:INCORPORATED_IN]->(country_of_incorporation) SET incorporated.state = party.incorporation_state )
For each party in the contract (a.parties
), it:
- Upserts (Merges) an
Organization
node for the party. - Creates an
IS_PARTY_TO
relationship between theOrganization
and theAgreement
, setting therole
of the party (e.g., buyer, seller). - Merges a
Country
node for the country in which the organization is incorporated. - Creates an
INCORPORATED_IN
relationship between the organization and the incorporation country and sets thestate
where the organization is incorporated.
Create Contract Clauses and Excerpts:
WITH a, agreement, [clause IN a.clauses WHERE clause.exists = true] AS valid_clauses FOREACH (clause IN valid_clauses | CREATE (cl:ContractClause {type: clause.clause_type}) MERGE (agreement)-[clt:HAS_CLAUSE]->(cl) SET clt.type = clause.clause_type FOREACH (excerpt IN clause.excerpts | MERGE (cl)-[:HAS_EXCERPT]->(e:Excerpt {text: excerpt}) ) MERGE (clType:ClauseType{name: clause.clause_type}) MERGE (cl)-[:HAS_TYPE]->(clType) )
- This part first filters the list of clauses (
a.clauses
) to include only those whereclause.exists = true
(i.e., clauses with excerpts identified by the LLM in Step 1).
For each clause:
ContractClause
node with a name
and type
corresponding to the clause type.HAS_CLAUSE
relationship is established between the Agreement
and the ContractClause
.excerpt
associated with the clause, it creates an Excerpt
node and links it to the ContractClause
using a HAS_EXCERPT
relationship.ClauseType
node is created (or merged) for the type of the clause, and the ContractClause
is linked to the ClauseType
using a HAS_TYPE
relationship.Once the import script runs, a single contract can be visualized in Neo4j as a knowledge graph.
A knowledge graph representation of a single contract: parties (organizations) in green, contract clauses in blue, excerpts in light brown, countries in orange. Image by the Author.
The three contracts in the knowledge graph required only a small graph (less than 100 nodes and 200 relationships). Most importantly, only 40–50 vector embeddings for the excerpts are needed. This knowledge graph with a small number of vectors can now be used to power a reasonably powerful Q&A agent.
Step 3. Developing Data Retrieval Functions for GraphRAG
With the contracts now structured in a knowledge graph, the next step involves creating a small set of graph data retrieval functions. These functions serve as the core building blocks, allowing us to develop a Q&A agent in step 4.
Let’s define a few basic data retrieval functions:
- Retrieve basic details about a contract (given a contract ID).
- Find contracts involving a specific organization (given a partial organization name).
- Find contracts that do not contain a particular clause type.
- Find contracts contain a specific type of clause.
- Find contracts based on the semantic similarity with the text (excerpt) in a clause (e.g., contracts mentioning the use of “prohibited items”).
- Run a natural language query against all contracts in the database – for example, an aggregation query that counts “how many contracts meet certain conditions.”
In step 4, we will build a Q&A using the Microsoft Semantic Kernel library. This library simplifies the agent building process. It allows developers to define the functions and tools that an agent will have at its disposal to answer a question.
In order to simplify the integration between Neo4j and the Semantic Kernel library, let’s define a ContractPlugin
that defines the “signature” of each of our data retrieval functions. Note the @kernel_function
decorator for each of the functions and also the type information and description provided for each function.
Semantic Kernel uses the concept of a “Plugin” class to encapsulate a group of functions available to an agent. It will use the decorated functions, type information, and documentation to inform the LLM function calling capabilities about functions available:
from typing import List, Optional, Annotated from AgreementSchema import Agreement, ClauseType from semantic_kernel.functions import kernel_function from ContractService import ContractSearchService class ContractPlugin: def __init__(self, contract_search_service: ContractSearchService ): self.contract_search_service = contract_search_service @kernel_function async def get_contract(self, contract_id: int) -> Annotated[Agreement, "A contract"]: """Gets details about a contract with the given id.""" return await self.contract_search_service.get_contract(contract_id) @kernel_function async def get_contracts(self, organization_name: str) -> Annotated[List[Agreement], "A list of contracts"]: """Gets basic details about all contracts where one of the parties has a name similar to the given organization name.""" return await self.contract_search_service.get_contracts(organization_name) @kernel_function async def get_contracts_without_clause(self, clause_type: ClauseType) -> Annotated[List[Agreement], "A list of contracts"]: """Gets basic details from contracts without a clause of the given type.""" return await self.contract_search_service.get_contracts_without_clause(clause_type=clause_type) @kernel_function async def get_contracts_with_clause_type(self, clause_type: ClauseType) -> Annotated[List[Agreement], "A list of contracts"]: """Gets basic details from contracts with a clause of the given type.""" return await self.contract_search_service.get_contracts_with_clause_type(clause_type=clause_type) @kernel_function async def get_contracts_similar_text(self, clause_text: str) -> Annotated[List[Agreement], "A list of contracts with similar text in one of their clauses"]: """Gets basic details from contracts having semantically similar text in one of their clauses to the to the 'clause_text' provided.""" return await self.contract_search_service.get_contracts_similar_text(clause_text=clause_text) @kernel_function async def answer_aggregation_question(self, user_question: str) -> Annotated[str, "An answer to user_question"]: """Answer obtained by turning user_question into a CYPHER query""" return await self.contract_search_service.answer_aggregation_question(user_question=user_question)
I would recommend exploring the ContractService class that contains the implementations of each of the above functions. Each function exercises a different data retrieval technique.
Let’s walk through the implementation of some of these functions as they showcase different GraphRAG data retrieval techniques and patterns.
Get Contract (From Contract ID) — A Cypher-Based Retrieval Function
The get_contract(self, contract_id: int)
is an asynchronous method designed to retrieve details about a specific contract (Agreement
) from a Neo4j database using a Cypher query. The function returns an Agreement
object populated with information about the agreement, clauses, parties, and their relationships.
Here’s the implementation of this function:
async def get_contract(self, contract_id: int) -> Agreement: GET_CONTRACT_BY_ID_QUERY = """ MATCH (a:Agreement {contract_id: $contract_id})-[:HAS_CLAUSE]->(clause:ContractClause) WITH a, collect(clause) as clauses MATCH (country:Country)-[i:INCORPORATED_IN]-(p:Organization)-[r:IS_PARTY_TO]-(a) WITH a, clauses, collect(p) as parties, collect(country) as countries, collect(r) as roles, collect(i) as states RETURN a as agreement, clauses, parties, countries, roles, states """ agreement_node = {} records, _, _ = self._driver.execute_query(GET_CONTRACT_BY_ID_QUERY,{'contract_id':contract_id}) if (len(records)==1): agreement_node = records[0].get('agreement') party_list = records[0].get('parties') role_list = records[0].get('roles') country_list = records[0].get('countries') state_list = records[0].get('states') clause_list = records[0].get('clauses') return await self._get_agreement( agreement_node, format="long", party_list=party_list, role_list=role_list, country_list=country_list,state_list=state_list, clause_list=clause_list )
The most important component is the Cypher query in GET_CONTRACT_BY_ID_QUERY
. This query is executed using contract_id supplied as input parameter. The output is the matching agreement, its clauses, and parties involved (each party has a role and country/state of incorporation).
The data is then passed to a _get_agreement
utility function, which simply maps the data to an “Agreement.” The agreement is TypedDict defined as:
class Agreement(TypedDict): contract_id: int agreement_name: str agreement_type: str effective_date: str expiration_date: str renewal_term: str notice_period_to_terminate_Renewal: str parties: List[Party] clauses: List[ContractClause]
Get Contracts Without a Clause Type — Another Cypher Retrieval Function
This function illustrates a powerful feature of a knowledge graph, which is to test for the absence of a relationship.
The get_contracts_without_clause()
function retrieves all contracts (Agreements
) from the Neo4j database that do not contain a specific type of clause. The function takes a ClauseType
as input and returns a list of Agreement
objects that match the condition.
This type of data retrieval information can’t be easily implemented with vector search. The full implementation follows:
async def get_contracts_without_clause(self, clause_type: ClauseType) -> List[Agreement]: GET_CONTRACT_WITHOUT_CLAUSE_TYPE_QUERY = """ MATCH (a:Agreement) OPTIONAL MATCH (a)-[:HAS_CLAUSE]->(cc:ContractClause {type: $clause_type}) WITH a,cc WHERE cc is NULL WITH a MATCH (country:Country)-[i:INCORPORATED_IN]-(p:Organization)-[r:IS_PARTY_TO]-(a) RETURN a as agreement, collect(p) as parties, collect(r) as roles, collect(country) as countries, collect(i) as states """ #run the Cypher query records, _ , _ = self._driver.execute_query(GET_CONTRACT_WITHOUT_CLAUSE_TYPE_QUERY,{'clause_type':clause_type.value}) all_agreements = [] for row in records: agreement_node = row['agreement'] party_list = row['parties'] role_list = row['roles'] country_list = row['countries'] state_list = row['states'] agreement : Agreement = await self._get_agreement( format="short", agreement_node=agreement_node, party_list=party_list, role_list=role_list, country_list=country_list, state_list=state_list ) all_agreements.append(agreement) return all_agreements
Once again, the format is similar to the previous function. A Cypher query, GET_CONTRACTS_WITHOUT_CLAUSE_TYPE_QUERY
, defines the nodes and relationship patterns to be matched. It performs an optional match to filter out contracts that do contain a clause type and collects related data about the agreement, such as the involved parties and their details.
The function then constructs and returns a list of Agreement
objects, which encapsulate all the relevant information for each matching agreement.
Get Contract With Semantically Similar Text — A Vector Search + Graph Data Retrieval Function
The get_contracts_similar_text()
function is designed to find agreements (contracts) that contain clauses with text similar to a provided clause_text
. It uses semantic vector search to identify related excerpts and traverses the graph to return information about the corresponding agreements and clauses and where those excerpts came from.
This function leverages a vector index defined on the “text” property of each excerpt. It uses the recently released Neo4j GraphRAG package to simplify the Cypher code needed to run semantic search + graph traversal code:
async def get_contracts_similar_text(self, clause_text: str) -> List[Agreement]: #Cypher to traverse from the semantically similar excerpts back to the agreement EXCERPT_TO_AGREEMENT_TRAVERSAL_QUERY=""" MATCH (a:Agreement)-[:HAS_CLAUSE]->(cc:ContractClause)-[:HAS_EXCERPT]-(node) RETURN a.name as agreement_name, a.contract_id as contract_id, cc.type as clause_type, node.text as excerpt """ #Set up vector Cypher retriever retriever = VectorCypherRetriever( driver= self._driver, index_name="excerpt_embedding", embedder=self._openai_embedder, retrieval_query=EXCERPT_TO_AGREEMENT_TRAVERSAL_QUERY, result_formatter=my_vector_search_excerpt_record_formatter ) # run vector search query on excerpts and get results containing the relevant agreement and clause retriever_result = retriever.search(query_text=clause_text, top_k=3) #set up List of Agreements (with partial data) to be returned agreements = [] for item in retriever_result.items: //extract information from returned items and append agreement to results // full code not shown here but available on the Github repo return agreements
Let’s go over the main components of this data retrieval function:
- The Neo4j GraphRAG VectorCypherRetriever allows a developer to perform semantic similarity on a vector index. In our case, for each semantically similar Excerpt “node” found, an additional Cypher expression is used to fetch additional nodes in the graph related to the node.
- The parameters of the VectorCypherRetriever are straightforward. The
index_name
is the vector index on which to run semantic similarity. The embedder generates a vector embedding for a piece of text. Thedriver
is just an instance of a Neo4j Python driver. Theretrieval_query
specifies the additional nodes and relationships connected with every “Excerpt” node identified by semantic similarity. - The
EXCERPT_TO_AGREEMENT_TRAVERSAL_QUERY
specifies the additional nodes to be retrieved. In this case, for every excerpt, we’re retrieving its related contract clause and corresponding agreement.
EXCERPT_TO_AGREEMENT_TRAVERSAL_QUERY=""" MATCH (a:Agreement)-[:HAS_CLAUSE]->(cc:ContractClause)-[:HAS_EXCERPT]-(node) RETURN a.name as agreement_name, a.contract_id as contract_id, cc.type as clause_type, node.text as excerpt """
Run a Natural Language Query — A Text2Cypher Data Retrieval Function
The answer_aggregation_question()
function leverages Neo4j GraphRAG package Text2CypherRetriever to answer a question in natural language. The Text2CypherRetriever uses an LLM to turn the user question into a Cypher query and runs it against the Neo4j database.
The function leverages GPT-4o to generate the required Cypher query. Let’s walk through the main components of this data retrieval function:
async def answer_aggregation_question(self, user_question) -> str: answer = "" NEO4J_SCHEMA = """ omitted for brevity (see below for the full value) """ # Initialize the retriever retriever = Text2CypherRetriever( driver=self._driver, llm=self._llm, neo4j_schema=NEO4J_SCHEMA ) # Generate a Cypher query using the LLM, send it to the Neo4j database, and return the results retriever_result = retriever.search(query_text=user_question) for item in retriever_result.items: content = str(item.content) if content: answer += content + '\n\n' return answer
This function leverages Neo4j GraphRAG package Text2CypherRetriever. It uses an LLM – in this case, OpenAI LLM is used to turn a user question (natural language) into a Cypher query executed against the database. The result of this query is returned.
A key element to ensure that the LLM generates a query that uses the nodes, relationships, and properties defined in the database is to provide the LLM with a text description of the schema.
In our case, we used the following representation of the data model is sufficient:
NEO4J_SCHEMA = """ Node properties: Agreement {agreement_type: STRING, contract_id: INTEGER,effective_date: STRING,renewal_term: STRING, name: STRING} ContractClause {name: STRING, type: STRING} ClauseType {name: STRING} Country {name: STRING} Excerpt {text: STRING} Organization {name: STRING} Relationship properties: IS_PARTY_TO {role: STRING} GOVERNED_BY_LAW {state: STRING} HAS_CLAUSE {type: STRING} INCORPORATED_IN {state: STRING} The relationships: (:Agreement)-[:HAS_CLAUSE]->(:ContractClause) (:ContractClause)-[:HAS_EXCERPT]->(:Excerpt) (:ContractClause)-[:HAS_TYPE]->(:ClauseType) (:Agreement)-[:GOVERNED_BY_LAW]->(:Country) (:Organization)-[:IS_PARTY_TO]->(:Agreement) (:Organization)-[:INCORPORATED_IN]->(:Country) """
Step 4. Building a Q&A Agent
Armed with our knowledge graph data retrieval functions, we’re ready to build an agent grounded by GraphRAG.
Let’s set up a chatbot agent capable of answering user queries about contracts using a combination of GPT-4o, our data retrieval functions, and a Neo4j-powered knowledge graph.
We will use Microsoft Semantic Kernel, a framework that allows developers to integrate LLM function calling with existing APIs and data retrieval functions.
The framework uses a concept called Plugins to represent specific functionality the kernel can perform. In our case, all of our data retrieval functions defined in the “ContractPlugin” can be used by the LLM to answer the question.
The framework uses the concept of Memory to keep all interactions between user and agent, as well as functions executed and data retrieved.
A extremely simple terminal-based agent can be implemented with a few lines of code. The snippet below shows the main parts of the agent (imports and environment vars removed):
logging.basicConfig(level=logging.INFO) # Initialize the kernel kernel = Kernel() # Add the Contract Search plugin to the kernel contract_search_neo4j = ContractSearchService(NEO4J_URI,NEO4J_USER,NEO4J_PASSWORD) kernel.add_plugin(ContractPlugin(contract_search_service=contract_search_neo4j),plugin_name="contract_search") # Add the OpenAI chat completion service to the Kernel kernel.add_service(OpenAIChatCompletion(ai_model_id="gpt-4o",api_key=OPENAI_KEY, service_id=service_id)) # Enable automatic function calling settings: OpenAIChatPromptExecutionSettings = kernel.get_prompt_execution_settings_from_service_id(service_id=service_id) settings.function_choice_behavior = FunctionChoiceBehavior.Auto(filters={"included_plugins": ["contract_search"]}) # Create a history of the conversation history = ChatHistory() async def basic_agent() : userInput = None while True: # Collect user input userInput = input("User > ") # Terminate the loop if the user says "exit" if userInput == "exit": break # Add user input to the history history.add_user_message(userInput) # 3. Get the response from the AI with automatic function calling chat_completion : OpenAIChatCompletion = kernel.get_service(type=ChatCompletionClientBase) result = (await chat_completion.get_chat_message_contents( chat_history=history, settings=settings, kernel=kernel, arguments=KernelArguments(), ))[0] # Print the results print("Assistant > " + str(result)) # Add the message from the agent to the chat history history.add_message(result) if __name__ == "__main__": asyncio.run(basic_agent())
Let’s ask our agent some non-trivial questions.
1. Find me contracts with price restrictions but no insurance clauses:
Assistant > I found the following contract that contains a Price Restrictions clause but does not contain an Insurance clause: - **Name:** Birch First Global Investments Inc. and Mount Knowledge Holdings Inc. MARKETING AFFILIATE AGREEMENT - **Agreement Type:** MARKETING AFFILIATE AGREEMENT - **Parties:** - **MOUNT KNOWLEDGE HOLDINGS INC.** (Marketing Affiliate), incorporated in Nevada, USA - **BIRCH FIRST GLOBAL INVESTMENTS INC.** (Company), incorporated in the Virgin Islands (VIR) Would you like to know more details about this contract?
Here, The Semantic Agent planner has reviewed the available functions in the kernel and determined that answering this question requires invoking two functions that, on their own, won’t provide the complete answer.
First, it executes the “Get Contracts with [Price Restriction] Clauses” function, followed by the “Get Contracts WITHOUT [Insurance] Clauses” function. The planner then combines the results from both functions to identify the intersection, ultimately arriving at the correct answer.
If you look at the debugging information, you can confirm the sequence of calls performed by the LLM, along with information on the number of tokens consumed by the call. Notice the calls with the two data retrieval functions.
2. Yes, please. Provide more details:
Assistant > Here are the detailed information about the contract: ### Contract Details: - **Name:** Birch First Global Investments Inc. and Mount Knowledge Holdings Inc. MARKETING AFFILIATE AGREEMENT - **Agreement Type:** MARKETING AFFILIATE AGREEMENT - **Expiration Date:** December 31, 2014 - **Renewal Term:** One (1) year ### Parties Involved: 1. **MOUNT KNOWLEDGE HOLDINGS INC.** - **Role:** Marketing Affiliate - **Incorporation Country:** USA - **Incorporation State:** Nevada 2. **BIRCH FIRST GLOBAL INVESTMENTS INC.** - **Role:** Company - **Incorporation Country:** Virgin Islands (VIR) ### Clauses Included: - No-Solicit Of Employees - Third Party Beneficiary - Price Restrictions - Affiliate License-Licensee - Affiliate License-Licensor - Cap On Liability - Audit Rights - Warranty Duration - Post-Termination Services - Non-Transferable License - License grant - IP Ownership Assignment - Minimum Commitment - Anti-Assignment - Non-Disparagement ### Clauses Not Included: - Insurance Is there anything specific you would like to know or any particular clause details you are interested in?
In this case, behind the scenes, the LLM used the “Find Contract by ID.” You may have noticed that the output shown to the user did not include the ID of the contract. However, the Semantic Kernel stored all information returned from the functions from previous calls in the conversation memory!
This powerful combination of having the agent decide which function(s) to call, the use of memory, and the automatic marshaling of the user queries into function calls with the right parameters explain the magic behind an agent built with Semantic Kernel.
3. Does any contract mention the ability to make price changes from time to time?
Assistant > Yes, there is a contract that mentions the ability to make price changes from time to time. Here is the relevant excerpt: 1. **Birch First Global Investments Inc. and Mount Knowledge Holdings Inc. MARKETING AFFILIATE AGREEMENT** - **Contract ID:** 1 - **Clause Type:** Price Restrictions - **Excerpt:** ``` Company reserves the right to change its process and/or fees, from time to time, in its sole and absolute discretion. In the event of a price and/or fee change for Technology, Company shall notify MA in writing ninety (90) days prior to the effective date of any such change. All other prices and/or fees may be adjusted without prior notice to MA. ``` Would you like more information about this contract or any other details?
In order to answer this question, Semantic Kernel is executing the get_contract_similar_text data retrieval function.
User > any contract mentions the ability to make price changes from time to time INFO:semantic_kernel.connectors.ai.open_ai.services.open_ai_chat_completion_base:processing 1 tool calls in parallel. INFO:semantic_kernel.kernel:Calling contract_search-get_contracts_similar_text function with args: {"clause_text":"price changes from time to time"} Assistant > Yes, there is a contract that mentions the ability to make price changes from time to time. Here is the relevant excerpt: 1. **Birch First Global Investments Inc. and Mount Knowledge Holdings Inc. MARKETING AFFILIATE AGREEMENT** - **Contract ID:** 1 - **Clause Type:** Price Restrictions - **Excerpt:** ``` Company reserves the right to change its process and/or fees, from time to time, in its sole and absolute discretion. In the event of a price and/or fee change for Technology, Company shall notify MA in writing ninety (90) days prior to the effective date of any such change. All other prices and/or fees may be adjusted without prior notice to MA. ```
As explained earlier, this function uses vector search and a Cypher query to retrieve not just semantically similar excerpts but also the related contract and clauses that include them.
Finally, let’s ask how many contracts are there on the database.
4. How many contracts are there in the database?
This question requires the LLM to use the Text2Cypher data retriever:
Assistant > There are a total of 3 contracts in the database.
The debug information confirms that Semantic Kernel ran the Text2Cypher data retrieval function with “user_question” = “How many contracts are there in the database?”
User > how many contracts are there on the database? INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:semantic_kernel.connectors.ai.open_ai.services.open_ai_chat_completion_base:processing 1 tool calls in parallel. INFO:semantic_kernel.kernel:Calling contract_search-answer_aggregation_question function with args: {"user_question":"How many contracts are there in the database?"} INFO:semantic_kernel.functions.kernel_function:Function completed. Duration: 0.588805s INFO:semantic_kernel.connectors.ai.open_ai.services.open_ai_handler:OpenAI usage: CompletionUsage(completion_tokens=13, prompt_tokens=3328, total_tokens=3341, completion_tokens_details={'reasoning_tokens': 0}) Assistant > There are a total of 3 contracts in the database.
Try It Yourself
The GitHub repo contains a Streamlit app that provides a more elegant agent UI. You’re encouraged to interact with the agent and make changes to the ContractPlugin so your agent’s able to handle more questions.
Conclusion
In this blog, we explored a GraphRAG approach to transform labor-intensive tasks of commercial contract review into a more efficient, AI-driven process.
By focusing on targeted information extraction using LLMs and prompts, building a structured knowledge graph with Neo4j, implementing simple data retrieval functions, and ultimately developing a Q&A agent, we created an intelligent solution that handles complex questions effectively.
This approach minimizes inefficiencies found in traditional vector search-based RAG, focusing instead on extracting only relevant information, reducing the need for unnecessary vector embeddings, and simplifying the overall process. We hope this journey from contract ingestion to an interactive Q&A agent inspires you to leverage GraphRAG in your own projects for improved efficiency and smarter AI-driven decision-making.
Start building your own commercial contract review agent today and experience the power of GraphRAG firsthand!