How Entity Extraction Works

Understanding the multi-stage extraction pipeline, trade-offs between speed and accuracy, and how extractors work together.

The Challenge

Entity extraction from natural language is hard. Consider this text:

"Marc Andreessen and Ben Horowitz discussed their investment in OpenAI on the a]6z podcast. The San Francisco-based fund focuses on AI companies."

Humans easily identify: Marc Andreessen (person), Ben Horowitz (person), OpenAI (organization), a16z (organization), San Francisco (location), AI (technology/concept).

Machines need help.

Three Extraction Approaches

neo4j-agent-memory supports three fundamentally different extraction approaches, each with trade-offs:

[DIAGRAM PLACEHOLDER: Extraction Pipeline]

Table 1. Extraction Pipeline
[DIAGRAM PLACEHOLDER: Extraction Pipeline]
Input Text ↓ ┌─────────────────────────────────────┐ │ Stage 1: spaCy (Statistical NER) │ │ Fast, free, limited types │ └─────────────────────────────────────┘ ↓ ┌─────────────────────────────────────┐ │ Stage 2: GLiNER (Zero-shot NER) │ │ Custom types, good accuracy │ └─────────────────────────────────────┘ ↓ ┌─────────────────────────────────────┐ │ Stage 3: LLM (Context-aware) │ │ Best accuracy, highest cost │ └─────────────────────────────────────┘ ↓ ┌─────────────────────────────────────┐ │ Merge & Filter │ │ Deduplicate, resolve conflicts │ └─────────────────────────────────────┘ ↓ Knowledge Graph

Input Text
    ↓
┌─────────────────────────────────────┐
│ Stage 1: spaCy (Statistical NER)    │
│   Fast, free, limited types         │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Stage 2: GLiNER (Zero-shot NER)     │
│   Custom types, good accuracy       │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Stage 3: LLM (Context-aware)        │
│   Best accuracy, highest cost       │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Merge & Filter                      │
│   Deduplicate, resolve conflicts    │
└─────────────────────────────────────┘
    ↓
Knowledge Graph

spaCy: Statistical NER

spaCy uses pre-trained statistical models to recognize entities.

Pros	Cons
Very fast (native code)	Fixed entity types (PERSON, ORG, GPE, etc.)
No external API calls	May miss domain-specific entities
Free to run	Lower accuracy on specialized text
Runs offline	No confidence scores

Pros

Cons

Very fast (native code)

Fixed entity types (PERSON, ORG, GPE, etc.)

No external API calls

May miss domain-specific entities

Free to run

Lower accuracy on specialized text

Runs offline

No confidence scores

Best for: Fast initial extraction, common entity types, offline use.

GLiNER: Zero-Shot NER

GLiNER uses transformer models to extract entities based on type descriptions, not training data.

Pros	Cons
Custom entity types via descriptions	Slower than spaCy (neural network)
Good accuracy on diverse text	Requires model download (~400MB)
Confidence scores	May hallucinate on ambiguous text
Works with domain schemas	GPU recommended for speed

Pros

Cons

Custom entity types via descriptions

Slower than spaCy (neural network)

Good accuracy on diverse text

Requires model download (~400MB)

Confidence scores

May hallucinate on ambiguous text

Works with domain schemas

GPU recommended for speed

Best for: Domain-specific extraction, custom types, balancing speed and accuracy.

LLM: Context-Aware Extraction

LLM extractors use large language models (GPT-4, etc.) with structured output.

Pros	Cons
Best understanding of context	Highest latency
Can handle ambiguous text	Most expensive (API costs)
Explains reasoning	Rate limits
Most flexible	May be inconsistent

Pros

Cons

Best understanding of context

Highest latency

Can handle ambiguous text

Most expensive (API costs)

Explains reasoning

Rate limits

Most flexible

May be inconsistent

Best for: High-value extraction, complex text, when accuracy matters most.

Multi-Stage Pipelines

The real power comes from combining extractors in a pipeline:

from neo4j_agent_memory.extraction import ExtractionPipeline, MergeStrategy

pipeline = ExtractionPipeline(
    stages=[
        SpacyEntityExtractor(),           # Fast first pass
        GLiNEREntityExtractor.for_schema("podcast"),  # Domain refinement
    ],
    merge_strategy=MergeStrategy.CONFIDENCE,
)

Why Pipelines?

Speed vs. Cost: spaCy catches obvious entities instantly; GLiNER refines with domain knowledge; LLM handles the hard cases.
Redundancy: Multiple extractors provide confirmation. An entity found by both spaCy and GLiNER is more reliable.
Coverage: Different extractors have different strengths. spaCy finds common entities; domain schemas find specialized ones.

Merge Strategies

When multiple extractors find entities, how do we combine results?

CONFIDENCE (Default)

Keep the entity with the highest confidence score:

# If spaCy finds "John" (no confidence) and GLiNER finds "John Smith" (0.95)
# Result: "John Smith" with confidence 0.95

Best for: General use, when you trust confidence scores.

UNION

Keep all unique entities from all stages:

# spaCy: ["John", "Acme"]
# GLiNER: ["John Smith", "Acme Corp", "NYC"]
# Result: ["John", "John Smith", "Acme", "Acme Corp", "NYC"]

Best for: Maximum recall, when you’ll filter/merge later.

INTERSECTION

Keep only entities found by multiple stages:

# spaCy: ["John", "Acme"]
# GLiNER: ["John Smith", "Acme", "NYC"]
# Result: ["Acme"]  # Only "Acme" found by both

Best for: High precision, when you only want confirmed entities.

FIRST

Use first stage’s results, fallback to later stages on failure:

# If spaCy finds entities, use them
# Otherwise, try GLiNER
# Otherwise, try LLM

Best for: Fast-first extraction with fallback.

LAST

Later stages override earlier:

# GLiNER results override spaCy results
# LLM results override both

Best for: Trust the "smarter" extractor.

Domain Schemas

GLiNER’s zero-shot capability becomes powerful with domain schemas - descriptions of entity types for your specific use case.

# The "podcast" schema
{
    "PERSON": "A person mentioned or interviewed",
    "COMPANY": "A company, startup, or business",
    "PRODUCT": "A product, service, or platform",
    "CONCEPT": "A concept, methodology, or framework",
}

# GLiNER uses these descriptions to guide extraction
extractor = GLiNEREntityExtractor.for_schema("podcast")

Why Descriptions Matter

GLiNER doesn’t just match labels - it uses the semantic meaning of descriptions:

"PERSON" → Generic person detection
"A person mentioned or interviewed" → Focuses on people in interview context, not passing mentions

This is why domain schemas significantly improve extraction quality.

Relationship Extraction with GLiREL

Beyond entities, GLiREL extracts relationships:

from neo4j_agent_memory.extraction import GLiNERWithRelationsExtractor

extractor = GLiNERWithRelationsExtractor.for_poleo()
result = await extractor.extract("Marc works at a16z in San Francisco")

# Entities: [Marc, a16z, San Francisco]
# Relations: [Marc -[WORKS_AT]-> a16z, a16z -[LOCATED_IN]-> San Francisco]

Relationship extraction finds typed connections between entities, building richer knowledge graphs.

Automatic Relationship Storage

When adding messages with entity extraction enabled, extracted relationships are automatically stored as RELATED_TO relationships in Neo4j:

# Relationships are stored automatically when adding messages
await memory.short_term.add_message(
    "session-1",
    "user",
    "Brian Chesky founded Airbnb in San Francisco.",
    extract_entities=True,
    extract_relations=True,  # Default: True
)

# This creates:
# - Entity nodes: Brian Chesky (PERSON), Airbnb (ORGANIZATION), San Francisco (LOCATION)
# - MENTIONS relationships: Message -> Entity
# - RELATED_TO relationships: (Brian Chesky)-[:RELATED_TO {relation_type: "FOUNDED"}]->(Airbnb)

The relationship storage works for:

add_message() - Extract and store from individual messages (default: extract_relations=True)
add_messages_batch() - Batch operations (default: extract_relations=True, only applies when extract_entities=True)
extract_entities_from_session() - Post-hoc extraction from existing messages (default: extract_relations=True)

# Post-hoc extraction with relationship storage
result = await memory.short_term.extract_entities_from_session(
    "session-1",
    extract_relations=True,  # Default: True
)
print(f"Extracted {result['relations_extracted']} relationships")

Relationship Storage Strategies

Relationships can reference entities in two ways:

Same-message entities: When both the source and target entity are extracted from the same message, relationships are created using entity IDs (most efficient).
Cross-message entities: When a relationship references an entity from a previous message, the system looks up entities by name. This enables building a connected knowledge graph across a conversation.

Performance Considerations

Latency

Typical extraction times for a 500-word document:

Extractor	CPU	GPU
spaCy	10-50ms	N/A
GLiNER	200-500ms	50-100ms
LLM (GPT-4o-mini)	500-2000ms	N/A

Extractor

CPU

GPU

spaCy

10-50ms

N/A

GLiNER

200-500ms

50-100ms

LLM (GPT-4o-mini)

500-2000ms

N/A

Batch Processing

For many documents, use batch extraction:

result = await pipeline.extract_batch(
    texts,
    batch_size=32,      # GLiNER batch size
    max_concurrency=5,  # Parallel extractions
)

GLiNER specifically benefits from batching on GPU.

Streaming for Long Documents

For very long documents (>100K tokens), use streaming:

streamer = StreamingExtractor(
    extractor,
    chunk_size=4000,
    overlap=200,
)

async for chunk_result in streamer.extract_streaming(long_document):
    print(f"Chunk {chunk_result.chunk.index}: {chunk_result.entity_count} entities")

Choosing the Right Approach

Scenario Recommendation Pipeline Why

Scenario	Recommendation	Pipeline	Why
Fast indexing, common entities	spaCy only	`[spaCy]`	Speed, simplicity
Domain-specific text	GLiNER with schema	`[GLiNER]`	Custom types, good accuracy
High-value documents	Full pipeline	`[spaCy → GLiNER → LLM]`	Maximum accuracy
Mixed workload	Tiered pipeline	`[spaCy + GLiNER]`	Balance speed/accuracy
Relations needed	GLiNER + GLiREL	`[GLiNERWithRelations]`	Entity + relation extraction

Fast indexing, common entities

spaCy only

[spaCy]

Speed, simplicity

Domain-specific text

GLiNER with schema

[GLiNER]

Custom types, good accuracy

High-value documents

Full pipeline

[spaCy → GLiNER → LLM]

Maximum accuracy

Mixed workload

Tiered pipeline

[spaCy + GLiNER]

Balance speed/accuracy

Relations needed

GLiNER + GLiREL

[GLiNERWithRelations]

Entity + relation extraction

How Entity Extraction Works

The Challenge

Three Extraction Approaches

spaCy: Statistical NER

GLiNER: Zero-Shot NER

LLM: Context-Aware Extraction

Multi-Stage Pipelines

Why Pipelines?

Merge Strategies

CONFIDENCE (Default)

UNION

INTERSECTION

FIRST

LAST

Domain Schemas

Why Descriptions Matter

Relationship Extraction with GLiREL

Automatic Relationship Storage

Relationship Storage Strategies

Performance Considerations

Latency

Batch Processing

Streaming for Long Documents

Choosing the Right Approach

See Also