How Entity Extraction Works

Understanding the multi-stage extraction pipeline, trade-offs between speed and accuracy, and how extractors work together.

The Challenge

Entity extraction from natural language is hard. Consider this text:

"Marc Andreessen and Ben Horowitz discussed their investment in OpenAI on the a]6z podcast. The San Francisco-based fund focuses on AI companies."

Humans easily identify: Marc Andreessen (person), Ben Horowitz (person), OpenAI (organization), a16z (organization), San Francisco (location), AI (technology/concept).

Machines need help.

Three Extraction Approaches

neo4j-agent-memory supports three fundamentally different extraction approaches, each with trade-offs:

Table 1. Extraction Pipeline
[DIAGRAM PLACEHOLDER: Extraction Pipeline]
Input Text
    ↓
┌─────────────────────────────────────┐
│ Stage 1: spaCy (Statistical NER)    │
│   Fast, free, limited types         │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Stage 2: GLiNER (Zero-shot NER)     │
│   Custom types, good accuracy       │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Stage 3: LLM (Context-aware)        │
│   Best accuracy, highest cost       │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Merge & Filter                      │
│   Deduplicate, resolve conflicts    │
└─────────────────────────────────────┘
    ↓
Knowledge Graph

spaCy: Statistical NER

spaCy uses pre-trained statistical models to recognize entities.

Pros Cons

Very fast (native code)

Fixed entity types (PERSON, ORG, GPE, etc.)

No external API calls

May miss domain-specific entities

Free to run

Lower accuracy on specialized text

Runs offline

No confidence scores

Best for: Fast initial extraction, common entity types, offline use.

GLiNER: Zero-Shot NER

GLiNER uses transformer models to extract entities based on type descriptions, not training data.

Pros Cons

Custom entity types via descriptions

Slower than spaCy (neural network)

Good accuracy on diverse text

Requires model download (~400MB)

Confidence scores

May hallucinate on ambiguous text

Works with domain schemas

GPU recommended for speed

Best for: Domain-specific extraction, custom types, balancing speed and accuracy.

LLM: Context-Aware Extraction

LLM extractors use large language models (GPT-4, etc.) with structured output.

Pros Cons

Best understanding of context

Highest latency

Can handle ambiguous text

Most expensive (API costs)

Explains reasoning

Rate limits

Most flexible

May be inconsistent

Best for: High-value extraction, complex text, when accuracy matters most.

Multi-Stage Pipelines

The real power comes from combining extractors in a pipeline:

from neo4j_agent_memory.extraction import ExtractionPipeline, MergeStrategy

pipeline = ExtractionPipeline(
    stages=[
        SpacyEntityExtractor(),           # Fast first pass
        GLiNEREntityExtractor.for_schema("podcast"),  # Domain refinement
    ],
    merge_strategy=MergeStrategy.CONFIDENCE,
)

Why Pipelines?

  1. Speed vs. Cost: spaCy catches obvious entities instantly; GLiNER refines with domain knowledge; LLM handles the hard cases.

  2. Redundancy: Multiple extractors provide confirmation. An entity found by both spaCy and GLiNER is more reliable.

  3. Coverage: Different extractors have different strengths. spaCy finds common entities; domain schemas find specialized ones.

Merge Strategies

When multiple extractors find entities, how do we combine results?

CONFIDENCE (Default)

Keep the entity with the highest confidence score:

# If spaCy finds "John" (no confidence) and GLiNER finds "John Smith" (0.95)
# Result: "John Smith" with confidence 0.95

Best for: General use, when you trust confidence scores.

UNION

Keep all unique entities from all stages:

# spaCy: ["John", "Acme"]
# GLiNER: ["John Smith", "Acme Corp", "NYC"]
# Result: ["John", "John Smith", "Acme", "Acme Corp", "NYC"]

Best for: Maximum recall, when you’ll filter/merge later.

INTERSECTION

Keep only entities found by multiple stages:

# spaCy: ["John", "Acme"]
# GLiNER: ["John Smith", "Acme", "NYC"]
# Result: ["Acme"]  # Only "Acme" found by both

Best for: High precision, when you only want confirmed entities.

FIRST

Use first stage’s results, fallback to later stages on failure:

# If spaCy finds entities, use them
# Otherwise, try GLiNER
# Otherwise, try LLM

Best for: Fast-first extraction with fallback.

LAST

Later stages override earlier:

# GLiNER results override spaCy results
# LLM results override both

Best for: Trust the "smarter" extractor.

Domain Schemas

GLiNER’s zero-shot capability becomes powerful with domain schemas - descriptions of entity types for your specific use case.

# The "podcast" schema
{
    "PERSON": "A person mentioned or interviewed",
    "COMPANY": "A company, startup, or business",
    "PRODUCT": "A product, service, or platform",
    "CONCEPT": "A concept, methodology, or framework",
}

# GLiNER uses these descriptions to guide extraction
extractor = GLiNEREntityExtractor.for_schema("podcast")

Why Descriptions Matter

GLiNER doesn’t just match labels - it uses the semantic meaning of descriptions:

  • "PERSON" → Generic person detection

  • "A person mentioned or interviewed" → Focuses on people in interview context, not passing mentions

This is why domain schemas significantly improve extraction quality.

Relationship Extraction with GLiREL

Beyond entities, GLiREL extracts relationships:

from neo4j_agent_memory.extraction import GLiNERWithRelationsExtractor

extractor = GLiNERWithRelationsExtractor.for_poleo()
result = await extractor.extract("Marc works at a16z in San Francisco")

# Entities: [Marc, a16z, San Francisco]
# Relations: [Marc -[WORKS_AT]-> a16z, a16z -[LOCATED_IN]-> San Francisco]

Relationship extraction finds typed connections between entities, building richer knowledge graphs.

Automatic Relationship Storage

When adding messages with entity extraction enabled, extracted relationships are automatically stored as RELATED_TO relationships in Neo4j:

# Relationships are stored automatically when adding messages
await memory.short_term.add_message(
    "session-1",
    "user",
    "Brian Chesky founded Airbnb in San Francisco.",
    extract_entities=True,
    extract_relations=True,  # Default: True
)

# This creates:
# - Entity nodes: Brian Chesky (PERSON), Airbnb (ORGANIZATION), San Francisco (LOCATION)
# - MENTIONS relationships: Message -> Entity
# - RELATED_TO relationships: (Brian Chesky)-[:RELATED_TO {relation_type: "FOUNDED"}]->(Airbnb)

The relationship storage works for:

  • add_message() - Extract and store from individual messages (default: extract_relations=True)

  • add_messages_batch() - Batch operations (default: extract_relations=True, only applies when extract_entities=True)

  • extract_entities_from_session() - Post-hoc extraction from existing messages (default: extract_relations=True)

# Post-hoc extraction with relationship storage
result = await memory.short_term.extract_entities_from_session(
    "session-1",
    extract_relations=True,  # Default: True
)
print(f"Extracted {result['relations_extracted']} relationships")

Relationship Storage Strategies

Relationships can reference entities in two ways:

  1. Same-message entities: When both the source and target entity are extracted from the same message, relationships are created using entity IDs (most efficient).

  2. Cross-message entities: When a relationship references an entity from a previous message, the system looks up entities by name. This enables building a connected knowledge graph across a conversation.

Performance Considerations

Latency

Typical extraction times for a 500-word document:

Extractor CPU GPU

spaCy

10-50ms

N/A

GLiNER

200-500ms

50-100ms

LLM (GPT-4o-mini)

500-2000ms

N/A

Batch Processing

For many documents, use batch extraction:

result = await pipeline.extract_batch(
    texts,
    batch_size=32,      # GLiNER batch size
    max_concurrency=5,  # Parallel extractions
)

GLiNER specifically benefits from batching on GPU.

Streaming for Long Documents

For very long documents (>100K tokens), use streaming:

streamer = StreamingExtractor(
    extractor,
    chunk_size=4000,
    overlap=200,
)

async for chunk_result in streamer.extract_streaming(long_document):
    print(f"Chunk {chunk_result.chunk.index}: {chunk_result.entity_count} entities")

Choosing the Right Approach

Scenario Recommendation Pipeline Why

Fast indexing, common entities

spaCy only

[spaCy]

Speed, simplicity

Domain-specific text

GLiNER with schema

[GLiNER]

Custom types, good accuracy

High-value documents

Full pipeline

[spaCy → GLiNER → LLM]

Maximum accuracy

Mixed workload

Tiered pipeline

[spaCy + GLiNER]

Balance speed/accuracy

Relations needed

GLiNER + GLiREL

[GLiNERWithRelations]

Entity + relation extraction

See Also