Extractor Classes Reference
Reference for entity extraction classes and the extraction pipeline.
EntityExtractor Protocol
All extractors implement the EntityExtractor protocol:
from typing import Protocol
from neo4j_agent_memory.extraction import ExtractionResult
class EntityExtractor(Protocol):
async def extract(self, text: str) -> ExtractionResult:
"""Extract entities from text."""
...
async def extract_batch(
self,
texts: list[str],
batch_size: int = 10,
max_concurrency: int = 5,
) -> BatchExtractionResult:
"""Extract entities from multiple texts."""
...
ExtractionResult
Result of entity extraction:
@dataclass
class ExtractionResult:
entities: list[ExtractedEntity]
source_text: str | None = None
metadata: dict[str, Any] = field(default_factory=dict)
@property
def entity_count(self) -> int: ...
def filter_by_type(self, entity_type: str) -> ExtractionResult: ...
def filter_by_confidence(self, min_confidence: float) -> ExtractionResult: ...
def filter_invalid_entities(self) -> ExtractionResult: ...
ExtractedEntity
Individual extracted entity:
@dataclass
class ExtractedEntity:
name: str
entity_type: str
confidence: float
subtype: str | None = None
start_pos: int | None = None
end_pos: int | None = None
metadata: dict[str, Any] = field(default_factory=dict)
@property
def full_type(self) -> str:
"""Returns 'TYPE:SUBTYPE' or just 'TYPE'."""
...
SpacyEntityExtractor
Statistical NER using spaCy models.
Constructor
SpacyEntityExtractor(
model: str = "en_core_web_sm",
type_mapping: dict[str, str] | None = None,
)
Parameters
| Parameter | Default | Description |
|---|---|---|
|
|
spaCy model name (requires installation) |
|
default mapping |
Map spaCy labels to POLE+O types |
GLiNEREntityExtractor
Zero-shot NER using GLiNER models with domain schemas.
Constructor
GLiNEREntityExtractor(
model: str = "gliner-community/gliner_medium-v2.5",
schema: DomainSchema | str | None = None,
threshold: float = 0.5,
device: str = "cpu",
)
Parameters
| Parameter | Default | Description |
|---|---|---|
|
|
GLiNER model name |
|
|
Domain schema (name or object) |
|
|
Confidence threshold (0.0-1.0) |
|
|
Device for inference ( |
LLMEntityExtractor
LLM-based extraction using OpenAI models.
Constructor
LLMEntityExtractor(
model: str = "gpt-4o-mini",
api_key: str | None = None,
entity_types: list[str] | None = None,
temperature: float = 0.0,
)
ExtractionPipeline
Multi-stage extraction combining multiple extractors.
Constructor
ExtractionPipeline(
stages: list[EntityExtractor],
merge_strategy: MergeStrategy = MergeStrategy.CONFIDENCE,
)
Merge Strategies
| Strategy | Description |
|---|---|
|
Keep entity with highest confidence |
|
Keep all unique entities from all stages |
|
Keep only entities found by multiple stages |
|
Use first stage’s result, fallback to later stages |
|
Use last stage’s result, override earlier |
Example
from neo4j_agent_memory.extraction import (
ExtractionPipeline,
SpacyEntityExtractor,
GLiNEREntityExtractor,
MergeStrategy,
)
pipeline = ExtractionPipeline(
stages=[
SpacyEntityExtractor(),
GLiNEREntityExtractor.for_schema("podcast"),
],
merge_strategy=MergeStrategy.CONFIDENCE,
)
result = await pipeline.extract(text)
ExtractorBuilder
Fluent builder for creating extraction pipelines.
from neo4j_agent_memory.extraction import ExtractorBuilder
extractor = (
ExtractorBuilder()
.with_spacy("en_core_web_sm")
.with_gliner_schema("podcast", threshold=0.5)
.with_llm_fallback("gpt-4o-mini")
.merge_by_confidence()
.build()
)
Builder Methods
| Method | Description |
|---|---|
|
Add spaCy extractor |
|
Add GLiNER with default schema |
|
Add GLiNER with domain schema |
|
Add LLM extractor |
|
Add LLM as fallback stage |
|
Use confidence merge strategy |
|
Use union merge strategy |
|
Create extractor/pipeline |
GLiRELExtractor
Relation extraction using GLiREL (requires GLiNER entities first).
Constructor
GLiRELExtractor(
model: str = "jackboyla/glirel_base",
relation_types: dict[str, str] | None = None,
threshold: float = 0.5,
device: str = "cpu",
)
Example
from neo4j_agent_memory.extraction import (
GLiNEREntityExtractor,
GLiRELExtractor,
)
# Extract entities first
entity_extractor = GLiNEREntityExtractor.for_poleo()
entity_result = await entity_extractor.extract(text)
# Then extract relations
relation_extractor = GLiRELExtractor()
relations = await relation_extractor.extract_relations(
text,
entities=entity_result.entities,
)
for rel in relations:
print(f"{rel.source} -[{rel.relation_type}]-> {rel.target}")
GLiNERWithRelationsExtractor
Combined entity and relation extraction.
from neo4j_agent_memory.extraction import GLiNERWithRelationsExtractor
extractor = GLiNERWithRelationsExtractor.for_poleo()
result = await extractor.extract("John works at Acme Corp")
print(result.entities) # [John, Acme Corp]
print(result.relations) # [John -[WORKS_AT]-> Acme Corp]
StreamingExtractor
Process long documents in chunks.
from neo4j_agent_memory.extraction import StreamingExtractor
streamer = StreamingExtractor(
base_extractor,
chunk_size=4000,
overlap=200,
split_on_sentences=True,
)
# Stream results
async for chunk_result in streamer.extract_streaming(long_document):
print(f"Chunk {chunk_result.chunk.index}: {chunk_result.entity_count} entities")
# Or get complete result with deduplication
result = await streamer.extract(long_document, deduplicate=True)
See Also
-
Domain Schemas Reference - Available schemas
-
Configure Entity Extraction - Pipeline setup
-
Process Documents in Batch - Batch extraction