Build a Knowledge Graph from Documents
- What You’ll Learn
- Prerequisites
- Time Required
- What We’re Building
- Step 1: Project Setup
- Step 2: Prepare Sample Documents
- Step 3: Configure Domain-Specific Extraction
- Step 4: Query the Knowledge Graph
- Step 5: Visualize in Neo4j Browser
- Step 6: Use the Knowledge Graph with an Agent
- What You’ve Built
- Extending the Knowledge Graph
- Next Steps
- See Also
Extract entities and relationships from documents to build a queryable knowledge graph.
In this tutorial, we’ll process a collection of documents to automatically extract entities, discover relationships, and build a knowledge graph that agents can query. We’ll use a financial services example, but the same approach works for any domain.
What You’ll Learn
-
How to configure domain-specific entity extraction
-
How to process documents in batch
-
How to build relationships between entities
-
How to query the knowledge graph
-
How to visualize the extracted knowledge
Prerequisites
-
Completed Build Your First Memory-Enabled Agent tutorial
-
Neo4j running (Docker or Aura)
-
Basic understanding of entity extraction
Time Required
Approximately 45 minutes.
What We’re Building
A knowledge graph that:
-
Extracts entities (companies, people, securities) from financial documents
-
Discovers relationships (works at, invested in, located in)
-
Enables semantic queries across the knowledge
-
Powers intelligent agent responses
Step 1: Project Setup
Create a new project:
mkdir knowledge-graph-demo
cd knowledge-graph-demo
python -m venv venv
source venv/bin/activate
pip install neo4j-agent-memory[all] python-dotenv
Create .env:
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password123
OPENAI_API_KEY=your-openai-api-key
Step 2: Prepare Sample Documents
Create a documents folder with sample financial documents. For this tutorial, we’ll create synthetic documents:
# create_sample_docs.py
import os
os.makedirs("documents", exist_ok=True)
DOCUMENTS = {
"company_profile_acme.txt": """
Acme Investment Holdings LLC is a mid-sized investment firm headquartered in
New York City. Founded in 2010 by CEO Sarah Johnson, the firm manages
approximately $2 billion in assets for institutional and high-net-worth clients.
The firm specializes in technology and healthcare sector investments. Their
flagship fund, Acme Growth Fund, has consistently outperformed the S&P 500
benchmark over the past five years.
Key personnel include:
- Sarah Johnson, CEO and Founder
- Michael Chen, Chief Investment Officer
- Lisa Park, Head of Research
- Robert Williams, Chief Compliance Officer
The firm has offices in New York, Boston, and San Francisco.
""",
"earnings_report_q4.txt": """
Q4 2024 Earnings Summary - Tech Sector Overview
Apple Inc. (AAPL) reported record quarterly revenue of $119.6 billion,
driven by strong iPhone 15 sales. CEO Tim Cook highlighted growth in
services revenue and the Apple Vision Pro launch.
Microsoft Corporation (MSFT) exceeded expectations with $62 billion in
revenue. CEO Satya Nadella emphasized AI integration across products and
strong Azure cloud growth of 28% year-over-year.
NVIDIA Corporation (NVDA) continues to dominate the AI chip market with
$22 billion in data center revenue. CEO Jensen Huang announced expanded
partnerships with major cloud providers.
Amazon.com Inc. (AMZN) reported $170 billion in revenue with AWS growing
13%. CEO Andy Jassy highlighted AI services adoption and retail efficiency
improvements.
Alphabet Inc. (GOOGL) achieved $86 billion revenue with YouTube and Cloud
showing strong momentum. CEO Sundar Pichai announced Gemini AI integration
across Google products.
""",
"market_analysis.txt": """
2024 Technology Sector Analysis
The technology sector experienced significant transformation in 2024,
driven primarily by artificial intelligence investments. Morgan Stanley
analyst Brian Nowak raised price targets for several AI-focused companies.
Goldman Sachs technology analyst Eric Sheridan maintains overweight ratings
on Microsoft, Alphabet, and Amazon, citing cloud computing and AI tailwinds.
JPMorgan's semiconductor team, led by analyst Harlan Sur, upgraded NVIDIA
to overweight following strong data center demand. The firm also initiated
coverage on AMD with a buy rating.
BlackRock, the world's largest asset manager, increased technology sector
allocation in their model portfolios. CEO Larry Fink cited AI as a
"defining technology trend" in the recent quarterly letter.
Regional focus:
- Silicon Valley remains the primary hub for AI innovation
- Austin, Texas emerging as secondary tech hub
- Seattle maintaining cloud computing leadership
- New York strengthening fintech presence
""",
"client_meeting_notes.txt": """
Client Meeting Notes - Acme Investment Holdings
Date: January 15, 2024
Attendees: Sarah Johnson (Acme), Michael Chen (Acme), John Smith (Client)
Discussion Summary:
John Smith, portfolio manager at Riverside Capital, discussed rebalancing
their $50 million technology allocation. Current holdings include Apple,
Microsoft, and NVIDIA representing 60% of the portfolio.
Key points discussed:
1. Reduce concentration in NVIDIA given valuation concerns
2. Add exposure to cloud infrastructure through Amazon AWS
3. Consider Alphabet for AI/advertising diversification
4. Maintain Apple position for dividend income
Sarah Johnson recommended a phased rebalancing over Q1 2024 to minimize
market impact. Michael Chen will prepare detailed trade recommendations.
Action items:
- Michael Chen to send trade proposal by January 20
- John Smith to review with Riverside's risk committee
- Follow-up call scheduled for January 25
Next meeting: February 15, 2024 for Q1 review
"""
}
for filename, content in DOCUMENTS.items():
with open(f"documents/{filename}", "w") as f:
f.write(content.strip())
print(f"Created {len(DOCUMENTS)} sample documents in ./documents/")
Run it:
python create_sample_docs.py
Step 3: Configure Domain-Specific Extraction
Create a custom financial services schema:
# extract.py
import asyncio
import os
from pathlib import Path
from dotenv import load_dotenv
from neo4j_agent_memory import MemoryClient
from neo4j_agent_memory.extraction import (
GLiNEREntityExtractor,
GLiNERWithRelationsExtractor,
)
from neo4j_agent_memory.schema import EntitySchemaConfig, EntityTypeConfig
load_dotenv()
def create_financial_schema() -> EntitySchemaConfig:
"""Create a schema optimized for financial documents."""
return EntitySchemaConfig(
name="financial_services",
version="1.0",
description="Schema for financial services knowledge graph",
entity_types=[
EntityTypeConfig(
name="PERSON",
description="Individual people including executives, analysts, clients",
examples=["Tim Cook", "Sarah Johnson", "Brian Nowak"],
),
EntityTypeConfig(
name="COMPANY",
description="Companies, corporations, firms",
examples=["Apple Inc.", "Acme Investment Holdings", "Goldman Sachs"],
),
EntityTypeConfig(
name="SECURITY",
description="Stocks, bonds, ETFs with ticker symbols",
examples=["Apple (AAPL)", "Microsoft stock", "S&P 500"],
),
EntityTypeConfig(
name="FUND",
description="Investment funds, ETFs, mutual funds",
examples=["Acme Growth Fund", "Vanguard 500 Index"],
),
EntityTypeConfig(
name="LOCATION",
description="Cities, regions, offices",
examples=["New York City", "Silicon Valley", "Austin, Texas"],
),
EntityTypeConfig(
name="FINANCIAL_METRIC",
description="Revenue, earnings, amounts, percentages",
examples=["$119.6 billion revenue", "28% growth", "$50 million"],
),
EntityTypeConfig(
name="DATE",
description="Dates, quarters, years",
examples=["Q4 2024", "January 15, 2024", "2024"],
),
EntityTypeConfig(
name="SECTOR",
description="Industry sectors and categories",
examples=["technology sector", "healthcare", "AI"],
),
],
)
async def extract_from_documents():
"""Extract entities and relationships from all documents."""
# Initialize
client = MemoryClient(
neo4j_uri=os.getenv("NEO4J_URI"),
neo4j_user=os.getenv("NEO4J_USER"),
neo4j_password=os.getenv("NEO4J_PASSWORD"),
)
await client.initialize()
# Create extractor with custom schema
schema = create_financial_schema()
extractor = GLiNERWithRelationsExtractor.for_schema(schema)
print("ā Extractor initialized with financial schema")
print(f" Entity types: {[t.name for t in schema.entity_types]}")
# Process each document
doc_path = Path("documents")
documents = list(doc_path.glob("*.txt"))
print(f"\nš Processing {len(documents)} documents...")
all_entities = []
all_relations = []
for doc_file in documents:
print(f"\n Processing: {doc_file.name}")
# Read document
content = doc_file.read_text()
# Extract entities and relations
result = await extractor.extract(content)
print(f" Entities: {len(result.entities)}")
print(f" Relations: {len(result.relations)}")
# Store document as entity
doc_entity = await client.long_term.add_entity(
name=doc_file.stem.replace("_", " ").title(),
entity_type="DOCUMENT",
properties={
"filename": doc_file.name,
"content_preview": content[:200] + "...",
},
)
# Store entities with deduplication
for entity in result.entities:
stored, dedup = await client.long_term.add_entity(
name=entity.name,
entity_type=entity.type,
properties={
"confidence": entity.confidence,
"source_doc": doc_file.name,
},
)
# Link entity to source document
await client.long_term.add_relationship(
from_entity=stored.id,
to_entity=doc_entity.id,
relationship_type="MENTIONED_IN",
)
all_entities.append(stored)
if entity.type == "PERSON":
print(f" š¤ {entity.name}")
elif entity.type == "COMPANY":
print(f" š¢ {entity.name}")
elif entity.type == "SECURITY":
print(f" š {entity.name}")
elif entity.type == "LOCATION":
print(f" š {entity.name}")
# Store relationships
for relation in result.relations:
# Find source and target entities
source_entities = await client.long_term.search_entities(
query=relation.source,
limit=1,
)
target_entities = await client.long_term.search_entities(
query=relation.target,
limit=1,
)
if source_entities and target_entities:
await client.long_term.add_relationship(
from_entity=source_entities[0].id,
to_entity=target_entities[0].id,
relationship_type=relation.type.upper().replace(" ", "_"),
properties={
"confidence": relation.confidence,
"source_doc": doc_file.name,
},
)
all_relations.append(relation)
print(f" š {relation.source} --[{relation.type}]--> {relation.target}")
# Summary
print("\n" + "="*60)
print("š Knowledge Graph Summary")
print("="*60)
print(f"Documents processed: {len(documents)}")
print(f"Entities extracted: {len(all_entities)}")
print(f"Relationships discovered: {len(all_relations)}")
# Count by type
from collections import Counter
type_counts = Counter(e.type for e in all_entities)
print("\nEntities by type:")
for entity_type, count in type_counts.most_common():
print(f" {entity_type}: {count}")
await client.close()
return all_entities, all_relations
if __name__ == "__main__":
asyncio.run(extract_from_documents())
Run the extraction:
python extract.py
You should see output like:
ā Extractor initialized with financial schema
Entity types: ['PERSON', 'COMPANY', 'SECURITY', 'FUND', 'LOCATION', ...]
š Processing 4 documents...
Processing: company_profile_acme.txt
Entities: 12
Relations: 5
š¢ Acme Investment Holdings LLC
š New York City
š¤ Sarah Johnson
š¤ Michael Chen
...
==================================================
š Knowledge Graph Summary
==================================================
Documents processed: 4
Entities extracted: 45
Relationships discovered: 18
Entities by type:
PERSON: 15
COMPANY: 12
LOCATION: 8
SECURITY: 6
...
Step 4: Query the Knowledge Graph
Create a query interface:
# query.py
import asyncio
import os
from dotenv import load_dotenv
from neo4j_agent_memory import MemoryClient
load_dotenv()
async def main():
client = MemoryClient(
neo4j_uri=os.getenv("NEO4J_URI"),
neo4j_user=os.getenv("NEO4J_USER"),
neo4j_password=os.getenv("NEO4J_PASSWORD"),
)
await client.initialize()
print("š Knowledge Graph Query Interface")
print("="*50)
# Query 1: Find all companies
print("\n1. All Companies in the Knowledge Graph:")
companies = await client.long_term.search_entities(
query="",
entity_type="COMPANY",
limit=20,
)
for company in companies:
print(f" š¢ {company.name}")
# Query 2: Find people and their roles
print("\n2. Key People Mentioned:")
people = await client.long_term.search_entities(
query="CEO executive analyst",
entity_type="PERSON",
limit=10,
)
for person in people:
print(f" š¤ {person.name}")
# Query 3: Semantic search - AI companies
print("\n3. Entities Related to AI:")
ai_entities = await client.long_term.search_entities(
query="artificial intelligence machine learning AI",
limit=10,
)
for entity in ai_entities:
print(f" [{entity.type}] {entity.name} (score: {entity.score:.2f})")
# Query 4: Find relationships using Cypher
print("\n4. Person-Company Relationships:")
results = await client.long_term.execute_query(
"""
MATCH (p:Entity {type: 'PERSON'})-[r]->(c:Entity {type: 'COMPANY'})
RETURN p.name as person, type(r) as relationship, c.name as company
LIMIT 10
""",
)
for row in results:
print(f" {row['person']} --[{row['relationship']}]--> {row['company']}")
# Query 5: Find entities mentioned in multiple documents
print("\n5. Cross-Document Entities (mentioned in 2+ docs):")
results = await client.long_term.execute_query(
"""
MATCH (e:Entity)-[:MENTIONED_IN]->(d:Entity {type: 'DOCUMENT'})
WITH e, count(DISTINCT d) as doc_count
WHERE doc_count >= 2
RETURN e.name as entity, e.type as type, doc_count
ORDER BY doc_count DESC
LIMIT 10
""",
)
for row in results:
print(f" [{row['type']}] {row['entity']} - {row['doc_count']} documents")
# Query 6: Find path between two entities
print("\n6. Connection Path: Sarah Johnson to Apple:")
results = await client.long_term.execute_query(
"""
MATCH path = shortestPath(
(a:Entity {name: 'Sarah Johnson'})-[*..5]-(b:Entity)
)
WHERE b.name CONTAINS 'Apple'
RETURN [n in nodes(path) | n.name] as path_nodes
LIMIT 1
""",
)
for row in results:
print(f" Path: {' ā '.join(row['path_nodes'])}")
await client.close()
if __name__ == "__main__":
asyncio.run(main())
Run the queries:
python query.py
Step 5: Visualize in Neo4j Browser
Open Neo4j Browser at http://localhost:7474 and run these queries:
View Full Knowledge Graph
MATCH (n:Entity)
OPTIONAL MATCH (n)-[r]->(m:Entity)
RETURN n, r, m
LIMIT 100
View Person-Company Network
MATCH (p:Entity {type: 'PERSON'})-[r]->(c:Entity {type: 'COMPANY'})
RETURN p, r, c
View Document-Entity Connections
MATCH (e:Entity)-[r:MENTIONED_IN]->(d:Entity {type: 'DOCUMENT'})
RETURN e, r, d
LIMIT 50
|
[SCREENSHOT PLACEHOLDER] Description: Neo4j Browser showing the financial knowledge graph with PERSON nodes (blue) connected to COMPANY nodes (green) via WORKS_AT relationships, and SECURITY nodes (yellow) connected to companies. Image path: |
Step 6: Use the Knowledge Graph with an Agent
Create an agent that queries the knowledge graph:
# agent_with_kg.py
import asyncio
import os
import json
from dotenv import load_dotenv
from openai import AsyncOpenAI
from neo4j_agent_memory import MemoryClient
load_dotenv()
openai_client = AsyncOpenAI()
memory_client = None
async def initialize():
global memory_client
memory_client = MemoryClient(
neo4j_uri=os.getenv("NEO4J_URI"),
neo4j_user=os.getenv("NEO4J_USER"),
neo4j_password=os.getenv("NEO4J_PASSWORD"),
)
await memory_client.initialize()
async def search_knowledge_graph(query: str, entity_type: str = None) -> str:
"""Search the knowledge graph for relevant entities."""
entities = await memory_client.long_term.search_entities(
query=query,
entity_type=entity_type,
limit=10,
)
return json.dumps([
{"name": e.name, "type": e.type, "score": round(e.score, 2)}
for e in entities
])
async def find_relationships(entity_name: str) -> str:
"""Find relationships for an entity."""
results = await memory_client.long_term.execute_query(
"""
MATCH (e:Entity)-[r]-(related:Entity)
WHERE e.name CONTAINS $name
RETURN e.name as entity, type(r) as relation, related.name as related_entity, related.type as related_type
LIMIT 10
""",
parameters={"name": entity_name},
)
return json.dumps(results)
async def answer_question(question: str) -> str:
"""Answer a question using the knowledge graph."""
# First, search for relevant entities
entities = await search_knowledge_graph(question)
# Build context from knowledge graph
context = f"""
Knowledge Graph Results for: "{question}"
Relevant Entities:
{entities}
"""
# If question mentions a specific entity, find its relationships
keywords = question.lower().split()
for keyword in keywords:
if len(keyword) > 3:
relationships = await find_relationships(keyword)
if relationships and relationships != "[]":
context += f"\nRelationships for '{keyword}':\n{relationships}"
# Generate answer using LLM with knowledge graph context
response = await openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """You are a financial research assistant with access to a knowledge graph.
Use the provided knowledge graph data to answer questions accurately.
If the knowledge graph doesn't have the information, say so clearly."""
},
{
"role": "user",
"content": f"{context}\n\nQuestion: {question}"
}
],
temperature=0.3,
)
return response.choices[0].message.content
async def main():
await initialize()
print("š¤ Knowledge Graph Q&A Agent")
print("="*50)
print("Ask questions about the financial documents.")
print("Type 'quit' to exit.\n")
questions = [
"Who is the CEO of Acme Investment Holdings?",
"Which companies are mentioned in the earnings report?",
"What is the relationship between Sarah Johnson and Michael Chen?",
"What locations are mentioned in the documents?",
"Tell me about NVIDIA's performance.",
]
print("Sample questions you can ask:")
for q in questions:
print(f" ⢠{q}")
print()
while True:
question = input("You: ").strip()
if question.lower() == "quit":
break
if not question:
continue
answer = await answer_question(question)
print(f"\nš¤ Agent: {answer}\n")
await memory_client.close()
if __name__ == "__main__":
asyncio.run(main())
Run the agent:
python agent_with_kg.py
What You’ve Built
You now have a complete knowledge graph system that:
-
Extracts entities using a domain-specific schema
-
Discovers relationships between entities automatically
-
Links to source documents for provenance tracking
-
Enables semantic search across all knowledge
-
Powers intelligent Q&A with LLM integration
Extending the Knowledge Graph
Ideas for enhancement:
-
Add more document types: PDFs, web pages, emails
-
Enrich with external data: Wikipedia, company databases
-
Build custom relationship types: competitor_of, invested_in, partner_with
-
Add temporal tracking: When relationships were valid
-
Create domain dashboards: Visualize sector trends
Next Steps
-
Process Documents in Batch - Scale to thousands of documents
-
Configure Entity Extraction - Advanced extraction settings
-
Handle Duplicate Entities - Clean up your graph