Configure Entity Extraction
How to configure entity extraction pipelines to automatically build your context graph from conversations and documents.
Overview
Entity extraction transforms unstructured text into structured knowledge in your context graph. The extracted entities, relationships, and facts become the foundation for personalized agent interactions.
| Context Graph Construction |
|---|
|
Prerequisites
-
neo4j-agent-memoryinstalled -
For GLiNER extraction:
pip install neo4j-agent-memory[gliner] -
For LLM extraction: OpenAI API key or compatible LLM
Quick Start
Default Extraction
Use the built-in POLE+O schema for general-purpose extraction:
from neo4j_agent_memory import MemoryClient
from neo4j_agent_memory.extraction import GLiNEREntityExtractor
client = MemoryClient(
neo4j_uri="bolt://localhost:7687",
neo4j_user="neo4j",
neo4j_password="password",
)
# Create extractor with POLE+O schema
extractor = GLiNEREntityExtractor.for_poleo()
# Extract entities from text
text = """
Customer Jane Smith called about her order #12345 from Nike.
She purchased Air Max 90 shoes last week from our Manhattan store.
She mentioned she prefers next-day delivery for future orders.
"""
result = await extractor.extract(text)
print("Extracted entities:")
for entity in result.entities:
print(f" {entity.name} ({entity.type}) - confidence: {entity.confidence:.2f}")
Domain-Specific Schemas
Use pre-built schemas optimized for specific industries to build domain-relevant context graphs.
Financial Services Schema
from neo4j_agent_memory.extraction import GLiNEREntityExtractor
from neo4j_agent_memory.schema import get_schema
# Load financial services schema
extractor = GLiNEREntityExtractor.for_schema("financial")
text = """
Client meeting with Acme Investment Holdings regarding their Q4 portfolio review.
They currently hold 10,000 shares of Apple (AAPL) and 5,000 shares of Microsoft (MSFT).
The client expressed interest in increasing exposure to the AI sector, specifically
mentioning NVIDIA and AMD as potential additions. Risk tolerance remains moderate-growth.
Advisor Sarah Johnson recommended a 15% allocation to technology, balanced with
fixed income through the Vanguard Total Bond ETF (BND).
"""
result = await extractor.extract(text)
# Entities automatically typed for financial domain
for entity in result.entities:
print(f"{entity.name}: {entity.type}")
# Output:
# Acme Investment Holdings: ORGANIZATION
# Apple: SECURITY
# AAPL: TICKER
# Microsoft: SECURITY
# NVIDIA: SECURITY
# Sarah Johnson: PERSON
# Vanguard Total Bond ETF: SECURITY
# BND: TICKER
| Type | Description | Examples |
|---|---|---|
PERSON |
Clients, advisors, contacts |
"John Smith", "Sarah Johnson" |
ORGANIZATION |
Companies, funds, institutions |
"Acme Holdings", "BlackRock" |
SECURITY |
Stocks, bonds, ETFs, funds |
"Apple Inc.", "Treasury Bond" |
TICKER |
Stock/fund symbols |
"AAPL", "BND", "SPY" |
ACCOUNT |
Account types and numbers |
"IRA", "401k", "Account #12345" |
AMOUNT |
Dollar amounts, percentages |
"$50,000", "15%", "10,000 shares" |
DATE |
Dates and time periods |
"Q4 2024", "next quarter" |
SECTOR |
Industry sectors |
"Technology", "Healthcare" |
RISK_PROFILE |
Risk classifications |
"moderate-growth", "conservative" |
Ecommerce Retail Schema
extractor = GLiNEREntityExtractor.for_schema("ecommerce")
text = """
Customer inquiry from Jane Doe (Gold member) about order #ORD-98765.
She ordered Nike Air Max 90 in size 9 (SKU: NKE-AM90-WHT-9) from our
mobile app. The package was shipped via FedEx (tracking: 1234567890)
to her address in Brooklyn, NY. She's asking about the return policy
for the Adidas Ultraboost she's considering. Her preferred payment
method is Apple Pay and she mentioned she has a 20% off coupon code.
"""
result = await extractor.extract(text)
# Entities typed for ecommerce context graph
for entity in result.entities:
print(f"{entity.name}: {entity.type}")
# Output:
# Jane Doe: CUSTOMER
# ORD-98765: ORDER_ID
# Nike Air Max 90: PRODUCT
# NKE-AM90-WHT-9: SKU
# FedEx: CARRIER
# Brooklyn, NY: LOCATION
# Adidas Ultraboost: PRODUCT
# Apple Pay: PAYMENT_METHOD
# 20% off: PROMOTION
| Type | Description | Examples |
|---|---|---|
CUSTOMER |
Customer names and IDs |
"Jane Doe", "CUST-12345" |
PRODUCT |
Product names |
"Nike Air Max 90", "iPhone 15" |
SKU |
Product identifiers |
"NKE-AM90-001", "APL-IPH15-256" |
BRAND |
Brand names |
"Nike", "Apple", "Samsung" |
CATEGORY |
Product categories |
"Footwear", "Electronics" |
ORDER_ID |
Order identifiers |
"ORD-98765", "#12345" |
CARRIER |
Shipping carriers |
"FedEx", "UPS", "USPS" |
LOCATION |
Addresses, stores, warehouses |
"Brooklyn, NY", "Store #42" |
PAYMENT_METHOD |
Payment types |
"Apple Pay", "Visa **1234" |
PROMOTION |
Coupons, discounts, sales |
"20% off", "SUMMER2024" |
Custom Domain Schemas
Create custom schemas to extract domain-specific entities for your context graph.
Define Custom Entity Types
from neo4j_agent_memory.schema import EntitySchemaConfig, EntityTypeConfig
# Define a custom schema for insurance domain
insurance_schema = EntitySchemaConfig(
name="insurance",
version="1.0",
description="Schema for insurance industry context graphs",
entity_types=[
EntityTypeConfig(
name="POLICYHOLDER",
description="Insurance policy holder or applicant",
examples=["John Smith", "Acme Corporation"],
),
EntityTypeConfig(
name="POLICY",
description="Insurance policy with number",
examples=["Policy #INS-2024-001", "Auto Policy 12345"],
),
EntityTypeConfig(
name="COVERAGE",
description="Type of insurance coverage",
examples=["liability coverage", "comprehensive", "collision"],
),
EntityTypeConfig(
name="CLAIM",
description="Insurance claim reference",
examples=["Claim #CLM-98765", "accident claim"],
),
EntityTypeConfig(
name="PREMIUM",
description="Insurance premium amount",
examples=["$500/month", "annual premium of $6,000"],
),
EntityTypeConfig(
name="DEDUCTIBLE",
description="Policy deductible amount",
examples=["$1,000 deductible", "$500 collision deductible"],
),
EntityTypeConfig(
name="VEHICLE",
description="Insured vehicle",
examples=["2024 Toyota Camry", "Honda Accord"],
),
EntityTypeConfig(
name="PROPERTY",
description="Insured property",
examples=["123 Main St home", "commercial building"],
),
],
)
# Create extractor with custom schema
extractor = GLiNEREntityExtractor.for_schema(insurance_schema)
Save Schema to Neo4j
Persist schemas for reuse across sessions and applications:
from neo4j_agent_memory.schema import SchemaManager
manager = SchemaManager(client)
# Save schema to Neo4j
stored = await manager.save_schema(
insurance_schema,
created_by="admin",
set_active=True,
)
print(f"Schema saved with ID: {stored.id}")
# Later, load the schema
loaded = await manager.load_schema("insurance")
extractor = GLiNEREntityExtractor.for_schema(loaded.config)
Extend Built-in Schemas
Add custom types to existing schemas:
from neo4j_agent_memory.schema import get_schema, EntityTypeConfig
# Start with ecommerce schema
base_schema = get_schema("ecommerce")
# Add custom types for your business
custom_types = [
EntityTypeConfig(
name="LOYALTY_TIER",
description="Customer loyalty program tier",
examples=["Gold member", "Platinum status", "VIP"],
),
EntityTypeConfig(
name="SUBSCRIPTION",
description="Subscription service",
examples=["Prime membership", "monthly box subscription"],
),
EntityTypeConfig(
name="GIFT_CARD",
description="Gift card or store credit",
examples=["$50 gift card", "store credit balance"],
),
]
# Extend schema
extended_schema = base_schema.extend(
name="ecommerce_extended",
additional_types=custom_types,
)
extractor = GLiNEREntityExtractor.for_schema(extended_schema)
Multi-Stage Extraction Pipelines
Combine multiple extractors for comprehensive context graph construction.
GLiNER + LLM Pipeline
Use fast local extraction followed by LLM refinement:
from neo4j_agent_memory.extraction import (
ExtractionPipeline,
GLiNEREntityExtractor,
LLMEntityExtractor,
)
# Stage 1: Fast local extraction with GLiNER
gliner = GLiNEREntityExtractor.for_schema("financial")
# Stage 2: LLM for relationship extraction and refinement
llm = LLMEntityExtractor(
model="gpt-4o-mini",
extract_relations=True,
schema="financial",
)
# Build pipeline
pipeline = ExtractionPipeline(
stages=[gliner, llm],
merge_strategy="confidence", # Keep highest confidence
)
# Extract with full pipeline
result = await pipeline.extract(text)
print(f"Entities: {len(result.entities)}")
print(f"Relations: {len(result.relations)}")
# Relations show how entities connect in the context graph
for rel in result.relations:
print(f" {rel.source} --[{rel.type}]--> {rel.target}")
Conditional Pipeline
Apply different extractors based on content:
from neo4j_agent_memory.extraction import ConditionalPipeline
pipeline = ConditionalPipeline(
# Use financial schema for investment discussions
conditions=[
{
"keywords": ["portfolio", "investment", "stock", "bond", "dividend"],
"extractor": GLiNEREntityExtractor.for_schema("financial"),
},
{
"keywords": ["order", "shipping", "product", "return", "delivery"],
"extractor": GLiNEREntityExtractor.for_schema("ecommerce"),
},
],
# Default to general POLE+O
default=GLiNEREntityExtractor.for_poleo(),
)
result = await pipeline.extract(text)
Relationship Extraction
Extract relationships to build connected context graphs.
GLiREL for Relationships
Use GLiREL alongside GLiNER for zero-cost relationship extraction:
from neo4j_agent_memory.extraction import GLiNERWithRelationsExtractor
extractor = GLiNERWithRelationsExtractor.for_schema("ecommerce")
text = """
Jane Doe purchased Nike Air Max 90 from our Manhattan store.
The product was manufactured by Nike and shipped via FedEx.
"""
result = await extractor.extract(text)
print("Context Graph Edges:")
for rel in result.relations:
print(f" ({rel.source}) -[:{rel.type}]-> ({rel.target})")
# Output:
# (Jane Doe) -[:PURCHASED]-> (Nike Air Max 90)
# (Nike Air Max 90) -[:SOLD_AT]-> (Manhattan store)
# (Nike Air Max 90) -[:MANUFACTURED_BY]-> (Nike)
# (Nike Air Max 90) -[:SHIPPED_VIA]-> (FedEx)
Custom Relationship Types
Define domain-specific relationships:
from neo4j_agent_memory.extraction import GLiNERWithRelationsExtractor
# Custom relationship types for financial domain
financial_relations = [
# Person relationships
{"name": "ADVISES", "description": "Financial advisor advises client"},
{"name": "MANAGES", "description": "Manager manages account or fund"},
{"name": "AUTHORIZED_ON", "description": "Person authorized on account"},
# Security relationships
{"name": "HOLDS", "description": "Account holds security position"},
{"name": "TRADED", "description": "Executed trade in security"},
{"name": "BENCHMARKED_TO", "description": "Portfolio benchmarked to index"},
# Organization relationships
{"name": "SUBSIDIARY_OF", "description": "Company is subsidiary of parent"},
{"name": "CUSTODIED_AT", "description": "Assets custodied at institution"},
]
extractor = GLiNERWithRelationsExtractor(
entity_schema="financial",
relation_types=financial_relations,
)
Store Extracted Entities
Add extracted entities to your context graph in Neo4j.
Basic Storage
# Extract entities
result = await extractor.extract(text)
# Store each entity in the context graph
for entity in result.entities:
stored = await client.long_term.add_entity(
name=entity.name,
entity_type=entity.type,
properties={
"confidence": entity.confidence,
"source_text": text[:200],
"extracted_at": datetime.now().isoformat(),
},
)
print(f"Stored: {stored.name} ({stored.id})")
With Provenance Tracking
Track where entities came from for audit and quality:
# Store entity with provenance
entity = await client.long_term.add_entity(
name=extracted.name,
entity_type=extracted.type,
)
# Link to source message
await client.long_term.link_entity_to_message(
entity=entity,
message_id=message.id,
confidence=extracted.confidence,
start_pos=extracted.start,
end_pos=extracted.end,
context=extracted.context,
)
# Link to extractor for debugging
await client.long_term.link_entity_to_extractor(
entity=entity,
extractor_name="GLiNEREntityExtractor",
extractor_version="1.0",
confidence=extracted.confidence,
)
With Deduplication
Prevent duplicate nodes in your context graph:
from neo4j_agent_memory.memory import DeduplicationConfig
# Configure deduplication thresholds
dedup_config = DeduplicationConfig(
auto_merge_threshold=0.95, # Auto-merge highly similar entities
flag_threshold=0.85, # Flag for review between 0.85-0.95
use_fuzzy_matching=True, # Use string similarity too
match_same_type_only=True, # Only match within same type
)
# Store with deduplication
for extracted in result.entities:
entity, dedup_result = await client.long_term.add_entity(
name=extracted.name,
entity_type=extracted.type,
deduplication=dedup_config,
)
if dedup_result.action == "merged":
print(f"Merged '{extracted.name}' with existing '{dedup_result.matched_entity_name}'")
elif dedup_result.action == "flagged":
print(f"Flagged '{extracted.name}' for review against '{dedup_result.matched_entity_name}'")
else:
print(f"Created new entity: {entity.name}")
Batch Extraction
Process multiple documents efficiently:
# List of documents to process
documents = [
{"id": "doc-1", "text": "Customer John ordered iPhone 15..."},
{"id": "doc-2", "text": "Jane returned the Nike shoes..."},
{"id": "doc-3", "text": "Order #12345 shipped via FedEx..."},
# ... hundreds more
]
# Batch extraction
texts = [doc["text"] for doc in documents]
result = await extractor.extract_batch(
texts=texts,
batch_size=10,
max_concurrency=5,
on_progress=lambda done, total: print(f"Progress: {done}/{total}"),
)
print(f"Processed: {result.successful_items}/{result.total_items}")
print(f"Total entities extracted: {result.total_entities}")
print(f"Total relations extracted: {result.total_relations}")
Streaming Extraction
Process long documents in chunks:
from neo4j_agent_memory.extraction import StreamingExtractor
# Wrap extractor for streaming
streamer = StreamingExtractor(
extractor=extractor,
chunk_size=4000, # Characters per chunk
overlap=200, # Overlap to avoid splitting entities
)
# Process long document
long_document = open("annual_report.txt").read() # 100K+ characters
# Stream results as they're extracted
async for chunk_result in streamer.extract_streaming(long_document):
print(f"Chunk {chunk_result.chunk.index}: {chunk_result.entity_count} entities")
# Store entities as they're extracted
for entity in chunk_result.entities:
await client.long_term.add_entity(
name=entity.name,
entity_type=entity.type,
)
# Or get complete deduplicated result
result = await streamer.extract(long_document, deduplicate=True)
print(f"Total unique entities: {result.stats.deduplicated_entities}")
Performance Optimization
Choose the Right Extractor
| Extractor | Speed | Quality | Cost | Best For |
|---|---|---|---|---|
spaCy |
Very Fast |
Basic |
Free |
High-volume, standard entities |
GLiNER |
Fast |
Good |
Free |
Domain-specific, local deployment |
GLiNER + GLiREL |
Fast |
Good |
Free |
Entities + relationships |
LLM (GPT-4o-mini) |
Slow |
Excellent |
$$ |
Complex text, high accuracy needs |
Hybrid Pipeline |
Medium |
Excellent |
$ |
Production systems |
Optimize for Your Use Case
# Fast local extraction for real-time chat
extractor = GLiNEREntityExtractor.for_schema("ecommerce")
# Process in batches for bulk imports
result = await extractor.extract_batch(
texts=product_descriptions,
batch_size=50, # Larger batches for throughput
max_concurrency=10,
)
# Multi-stage pipeline for accuracy
pipeline = ExtractionPipeline(
stages=[
GLiNEREntityExtractor.for_schema("financial"),
LLMEntityExtractor(
model="gpt-4o", # Higher quality model
extract_relations=True,
temperature=0, # Deterministic
),
],
merge_strategy="union", # Keep all extractions
)
# Verify extractions before adding to context graph
result = await pipeline.extract(compliance_document)
for entity in result.entities:
if entity.confidence < 0.8:
# Flag for human review
await flag_for_review(entity)
Best Practices
1. Match Schema to Domain
Use domain-specific schemas for better extraction quality:
# Good: Domain-specific schema
extractor = GLiNEREntityExtractor.for_schema("financial")
# Less effective: Generic schema for specialized content
extractor = GLiNEREntityExtractor.for_poleo()
2. Include Entity Examples
Examples improve extraction accuracy:
EntityTypeConfig(
name="TICKER",
description="Stock ticker symbol",
# Good examples help the model
examples=["AAPL", "MSFT", "GOOGL", "NVDA", "BND", "SPY"],
)
3. Validate Before Storing
Check extraction quality before building the context graph:
for entity in result.entities:
# Skip low-confidence extractions
if entity.confidence < 0.6:
continue
# Skip very short entities (likely noise)
if len(entity.name) < 2:
continue
# Skip stopwords extracted as entities
if entity.name.lower() in stopwords:
continue
await client.long_term.add_entity(
name=entity.name,
entity_type=entity.type,
)
4. Track Extraction Quality
Monitor extraction performance over time:
# Log extraction metrics
metrics = {
"document_id": doc_id,
"text_length": len(text),
"entities_extracted": len(result.entities),
"relations_extracted": len(result.relations),
"avg_confidence": sum(e.confidence for e in result.entities) / len(result.entities),
"extraction_time_ms": result.extraction_time_ms,
"extractor": extractor.__class__.__name__,
}
await log_metrics(metrics)