How Entity Extraction Works
Understanding the multi-stage extraction pipeline, trade-offs between speed and accuracy, and how extractors work together.
The Challenge
Entity extraction from natural language is hard. Consider this text:
"Marc Andreessen and Ben Horowitz discussed their investment in OpenAI on the a]6z podcast. The San Francisco-based fund focuses on AI companies."
Humans easily identify: Marc Andreessen (person), Ben Horowitz (person), OpenAI (organization), a16z (organization), San Francisco (location), AI (technology/concept).
Machines need help.
Three Extraction Approaches
neo4j-agent-memory supports three fundamentally different extraction approaches, each with trade-offs:
| [DIAGRAM PLACEHOLDER: Extraction Pipeline] |
|---|
|
spaCy: Statistical NER
spaCy uses pre-trained statistical models to recognize entities.
| Pros | Cons |
|---|---|
Very fast (native code) |
Fixed entity types (PERSON, ORG, GPE, etc.) |
No external API calls |
May miss domain-specific entities |
Free to run |
Lower accuracy on specialized text |
Runs offline |
No confidence scores |
Best for: Fast initial extraction, common entity types, offline use.
GLiNER: Zero-Shot NER
GLiNER uses transformer models to extract entities based on type descriptions, not training data.
| Pros | Cons |
|---|---|
Custom entity types via descriptions |
Slower than spaCy (neural network) |
Good accuracy on diverse text |
Requires model download (~400MB) |
Confidence scores |
May hallucinate on ambiguous text |
Works with domain schemas |
GPU recommended for speed |
Best for: Domain-specific extraction, custom types, balancing speed and accuracy.
LLM: Context-Aware Extraction
LLM extractors use large language models (GPT-4, etc.) with structured output.
| Pros | Cons |
|---|---|
Best understanding of context |
Highest latency |
Can handle ambiguous text |
Most expensive (API costs) |
Explains reasoning |
Rate limits |
Most flexible |
May be inconsistent |
Best for: High-value extraction, complex text, when accuracy matters most.
Multi-Stage Pipelines
The real power comes from combining extractors in a pipeline:
from neo4j_agent_memory.extraction import ExtractionPipeline, MergeStrategy
pipeline = ExtractionPipeline(
stages=[
SpacyEntityExtractor(), # Fast first pass
GLiNEREntityExtractor.for_schema("podcast"), # Domain refinement
],
merge_strategy=MergeStrategy.CONFIDENCE,
)
Why Pipelines?
-
Speed vs. Cost: spaCy catches obvious entities instantly; GLiNER refines with domain knowledge; LLM handles the hard cases.
-
Redundancy: Multiple extractors provide confirmation. An entity found by both spaCy and GLiNER is more reliable.
-
Coverage: Different extractors have different strengths. spaCy finds common entities; domain schemas find specialized ones.
Merge Strategies
When multiple extractors find entities, how do we combine results?
CONFIDENCE (Default)
Keep the entity with the highest confidence score:
# If spaCy finds "John" (no confidence) and GLiNER finds "John Smith" (0.95)
# Result: "John Smith" with confidence 0.95
Best for: General use, when you trust confidence scores.
UNION
Keep all unique entities from all stages:
# spaCy: ["John", "Acme"]
# GLiNER: ["John Smith", "Acme Corp", "NYC"]
# Result: ["John", "John Smith", "Acme", "Acme Corp", "NYC"]
Best for: Maximum recall, when you’ll filter/merge later.
INTERSECTION
Keep only entities found by multiple stages:
# spaCy: ["John", "Acme"]
# GLiNER: ["John Smith", "Acme", "NYC"]
# Result: ["Acme"] # Only "Acme" found by both
Best for: High precision, when you only want confirmed entities.
Domain Schemas
GLiNER’s zero-shot capability becomes powerful with domain schemas - descriptions of entity types for your specific use case.
# The "podcast" schema
{
"PERSON": "A person mentioned or interviewed",
"COMPANY": "A company, startup, or business",
"PRODUCT": "A product, service, or platform",
"CONCEPT": "A concept, methodology, or framework",
}
# GLiNER uses these descriptions to guide extraction
extractor = GLiNEREntityExtractor.for_schema("podcast")
Why Descriptions Matter
GLiNER doesn’t just match labels - it uses the semantic meaning of descriptions:
-
"PERSON"→ Generic person detection -
"A person mentioned or interviewed"→ Focuses on people in interview context, not passing mentions
This is why domain schemas significantly improve extraction quality.
Relationship Extraction with GLiREL
Beyond entities, GLiREL extracts relationships:
from neo4j_agent_memory.extraction import GLiNERWithRelationsExtractor
extractor = GLiNERWithRelationsExtractor.for_poleo()
result = await extractor.extract("Marc works at a16z in San Francisco")
# Entities: [Marc, a16z, San Francisco]
# Relations: [Marc -[WORKS_AT]-> a16z, a16z -[LOCATED_IN]-> San Francisco]
Relationship extraction finds typed connections between entities, building richer knowledge graphs.
Automatic Relationship Storage
When adding messages with entity extraction enabled, extracted relationships are automatically stored as RELATED_TO relationships in Neo4j:
# Relationships are stored automatically when adding messages
await memory.short_term.add_message(
"session-1",
"user",
"Brian Chesky founded Airbnb in San Francisco.",
extract_entities=True,
extract_relations=True, # Default: True
)
# This creates:
# - Entity nodes: Brian Chesky (PERSON), Airbnb (ORGANIZATION), San Francisco (LOCATION)
# - MENTIONS relationships: Message -> Entity
# - RELATED_TO relationships: (Brian Chesky)-[:RELATED_TO {relation_type: "FOUNDED"}]->(Airbnb)
The relationship storage works for:
-
add_message()- Extract and store from individual messages (default:extract_relations=True) -
add_messages_batch()- Batch operations (default:extract_relations=True, only applies whenextract_entities=True) -
extract_entities_from_session()- Post-hoc extraction from existing messages (default:extract_relations=True)
# Post-hoc extraction with relationship storage
result = await memory.short_term.extract_entities_from_session(
"session-1",
extract_relations=True, # Default: True
)
print(f"Extracted {result['relations_extracted']} relationships")
Relationship Storage Strategies
Relationships can reference entities in two ways:
-
Same-message entities: When both the source and target entity are extracted from the same message, relationships are created using entity IDs (most efficient).
-
Cross-message entities: When a relationship references an entity from a previous message, the system looks up entities by name. This enables building a connected knowledge graph across a conversation.
Performance Considerations
Latency
Typical extraction times for a 500-word document:
| Extractor | CPU | GPU |
|---|---|---|
spaCy |
10-50ms |
N/A |
GLiNER |
200-500ms |
50-100ms |
LLM (GPT-4o-mini) |
500-2000ms |
N/A |
Batch Processing
For many documents, use batch extraction:
result = await pipeline.extract_batch(
texts,
batch_size=32, # GLiNER batch size
max_concurrency=5, # Parallel extractions
)
GLiNER specifically benefits from batching on GPU.
Streaming for Long Documents
For very long documents (>100K tokens), use streaming:
streamer = StreamingExtractor(
extractor,
chunk_size=4000,
overlap=200,
)
async for chunk_result in streamer.extract_streaming(long_document):
print(f"Chunk {chunk_result.chunk.index}: {chunk_result.entity_count} entities")
Choosing the Right Approach
| Scenario | Recommendation | Pipeline | Why |
|---|---|---|---|
Fast indexing, common entities |
spaCy only |
|
Speed, simplicity |
Domain-specific text |
GLiNER with schema |
|
Custom types, good accuracy |
High-value documents |
Full pipeline |
|
Maximum accuracy |
Mixed workload |
Tiered pipeline |
|
Balance speed/accuracy |
Relations needed |
GLiNER + GLiREL |
|
Entity + relation extraction |
See Also
-
Configure Entity Extraction - Practical setup
-
Extractor Classes Reference - API details
-
Domain Schemas Reference - Available schemas
-
Process Documents in Batch - Batch and streaming