Under the Covers With LightRAG: Extraction

Solutions Engineer, Neo4j

May 20, 2025

30 min read

Remember the days when naive retrieval-augmented generation (RAG) felt like magic? Upload a few documents to a vector store, hook it up to an LLM, and voilà! Instant answers! Fast-forward to today, that naive RAG setup is the new “Hello, World!” for the new world. The bar? It’s been raised — repeatedly. There are now an estimated 30+ RAG techniques floating around (thanks to the awesome RAG_Techniques repo from NirDiamant), all chasing the same holy grail: more accuracy, reliability, and richly context-aware answers. And most importantly, with the least amount of work.

In this blog series, I’ll break down yet another promising new RAG technique that has been gaining traction lately: LightRAG. It leans into the power of relationships and graphs to push the boundaries of what modern RAG systems can do.

TL;DR

Better answers for different question types: The dual-level keyword extraction and hybrid retrieval architecture handles specific, entity-focused questions and broader, conceptual inquiries. This means users get good answers whether they ask about detailed facts or big-picture concepts.
Smarter focus on what matters: Entities and relationships are ranked based on node and edge degree, retrieving information of the highest structural significance. This ensures that responses are not only semantically relevant but also anchored in what matters most within the knowledge graph.
Easy to update with new information: Powered by a schema-flexible knowledge graph, new entities, facts, and relationships can be easily added. This reduces the need for retraining and re-indexing, making it ideal for organizations with frequent or evolving data.

Introduction

Overall architecture of LightRAG framework — image from LightRAG’s repo

So what’s LightRAG? It shares similar architectural DNA with frameworks like GraphRAG, leveraging knowledge graphs (e.g., Neo4j) to enrich retrieval with structured, contextual information. But where many approaches treat graphs as an optional add-on, LightRAG makes them central to the retrieval process. Traditional RAG pipelines often rely too much on vector similarity and flat chunks. This has been proven to be great for shallow lookups, but not for answering questions that depend on understanding how things connect. Real-world data is relational by nature, and that’s where graphs shine.

Extraction Pipeline

At its core, LightRAG builds upon the idea that structured knowledge (entities and relationships) extracted from raw documents can enhance retrieval quality but without requiring a community summarization layer like Microsoft’s version of GraphRAG. The later is a technique also known as query focused summarization where given a user question and a community level, the community summaries are retrieved and given to the LLM as an enrichment step. You can read more about the different techniques on graphrag.com.

LightRAG’s pipeline takes a more streamlined path. It begins by cleaning and chunking raw documents, then uses LLMs to extract structured knowledge in the form of entities, relationships, and keywords. This extracted structure is saved into a knowledge graph and indexed into vector databases for fast semantic search in a parallel fashion. The extraction process is similar to that of GraphRAG but with some proposed enhancements which will be covered below.

Where query focused summarization leverages the power of community summaries to handle ‘global’ questions, LightRAG’s capabilities lie in how it aligns vector search and graph queries at retrieval time to provide grounded, explainable answers.

The key insight here is that LightRAG builds a multi-layered retrieval surface:

Semantic chunks for vector search
Graph entities and relationships for reasoning and relevance tracking
Metadata-enriched relationships for deeper filtering or traversal

The original relationships between concepts are preserved, and entity- and relationship-level embeddings are used for retrieval in conjunction with chunk-level vector embeddings. This layered retrieval strategy provides traceability and a richer context to the LLM, enabling a more accurate and grounded response.

At a high level, LightRAG processes documents in the following way.

Document Preparation Phase

This stage handles raw document ingestion, cleaning, de-duplication, and metadata preparation.

Document Ingestion and Text Cleaning

Remove null bytes
Strip whitespace

The clean_text function removes null bytes and leading/trailing whitespace from the text, ensuring that it’s ready for processing.

# lightrag/utils.py
def clean_text(text: str) -> str:
    """Clean text by removing null bytes (0x00) and whitespace

    Args:
        text: Input text to clean

    Returns:
        Cleaned text
    """
    return text.strip().replace("\x00", "")

De-Duplication

Check for duplicate content
Remove duplicates
Reconstruct unique content

The extracted contents are iterated to uniquely identify and prevent any content from duplication. The contents dictionary is then reconstructed to include only unique entries.

# lightrag/lightrag.py
unique_contents = {}
for id_, content_data in contents.items():
    content = content_data["content"]
    file_path = content_data["file_path"]
    if content not in unique_contents:
        unique_contents[content] = (id_, file_path)

contents = {
    id_: {"content": content, "file_path": file_path}
    for content, (id_, file_path) in unique_contents.items()
}

Content Summarization, Filtering, and Metadata Preparation

Generate content summary
Truncate if it exceeds max length

The get_content_summary function simply truncates the content to a specified maximum length, appending an ellipsis if the text exceeds that limit. The function effectively creates a brief excerpt, but labeling it as a “summary” might be misleading.

In my opinion, this snippet serves other practical purposes, such as quick previews and more efficient storage when handling large volumes of documents.

# lightrag/utils.py
def get_content_summary(content: str, max_length: int = 250) -> str:
    """Get summary of document content

    Args:
        content: Original document content
        max_length: Maximum length of summary

    Returns:
        Truncated content with ellipsis if needed
    """
    content = content.strip()
    if len(content) <= max_length:
        return content
    return content[:max_length] + "..."

Prepare document metadata
Filter already ingested
Upsert document status/metadata

Each document gets enriched with metadata that includes the earlier truncated preview of the content, along with timestamps for auditability and the original file path for traceability.

# lightrag/lightrag.py
# 3. Generate document initial status
new_docs: dict[str, Any] = {
    id_: {
        "status": DocStatus.PENDING,
        "content": content_data["content"],
        "content_summary": get_content_summary(content_data["content"]),
        "content_length": len(content_data["content"]),
        "created_at": datetime.now().isoformat(),
        "updated_at": datetime.now().isoformat(),
        "file_path": content_data[
            "file_path"
        ],  # Store file path in document status
    }
    for id_, content_data in contents.items()
}

# 4. Filter out already processed documents
# Get docs ids
all_new_doc_ids = set(new_docs.keys())
# Exclude IDs of documents that are already in progress
unique_new_doc_ids = await self.doc_status.filter_keys(all_new_doc_ids)

A database lookup is performed using the MD5 hash-based document ID to verify the uniqueness of the document.

# for entities
compute_mdhash_id(dp["entity_name"], prefix="ent-")

# for relationships
compute_mdhash_id(dp["src_id"] + dp["tgt_id"], prefix="rel-")

If it does, the filter_keys function will remove it from further processing. Below is a sample input for new_docs. Let’s assume that if doc-a1b2c3 has already been processed and exists in the database, the pipeline will exclude it from further processing, and the metadata records for each document are then ingested into the key-value store.

# From
new_docs = {
    "doc-a1b2c3": {
        "status": "PENDING",
        "content_summary": "Sample content 1",
        "content_length": 29,
        "file_path": "file1.txt",
        ...
    },
    "doc-123abc": {
        "status": "PENDING",
        "content_summary": "Sample content 2",
        "content_length": 24,
        "file_path": "file2.txt",
        ...
    }
}

# To
new_docs = {
    "doc-123abc": {
        "status": "PENDING",
        "content_summary": "Sample content 2",
        "content_length": 24,
        "file_path": "file2.txt",
        ...
    }
}

Semantic Enrichment Phase

The following stage focuses on converting cleaned text into usable semantic and graph structures. Once documents are cleaned, de-duplicated, and filtered in the earlier phase, LightRAG moves into its second pre-processing phase, where the real transformation begins. Here, we take unstructured text and chunk, embed, extract, and structure it into vectors and a graph. This is the bedrock of semantic search and graph-powered reasoning.

Chunking and Embedding

Chunking by token size

The overlap chunking strategy is used to retain semantic context across overlapping windows. There is an overlap token size of 128 by default. We won’t debate the “perfect” chunk size here — that’s a nuanced topic for another post — but the chunking function is entirely configurable via LightRAG.chunking_func.

# lightrag/operate.py
def chunking_by_token_size(
    content: str,
    split_by_character: str | None = None,
    overlap_token_size: int = 128,
    max_token_size: int = 1024,
    ...
) -> list[dict[str, Any]]:

Generate embeddings

Once the content is chunked, each chunk is passed through your configured embedding function, which could use OpenAI, Claude, or a local embedding model.

The embedding results are stored along with metadata like full_doc_id, file_path, and content into a vector index or database, enabling traceable and explainable semantic search.

# lightrag/lightrag.py
self.chunks_vdb: BaseVectorStorage = self.vector_db_storage_cls(  # type: ignore
    namespace=make_namespace(
        self.namespace_prefix, NameSpace.VECTOR_STORE_CHUNKS
    ),
    embedding_func=self.embedding_func,
    meta_fields={"full_doc_id", "content", "file_path"},
)

chunks_vdb_task = asyncio.create_task(
    self.chunks_vdb.upsert(chunks)
)

Entity and Relationship Extraction

Process chunks for extraction
LLM extraction prompt

Before we can extract entities and relationships from the chunks of text, LightRAG prepares a highly structured prompt for the LLM. The structured prompt is composed using configurable entity types and few-shot examples, which enables the user with more control.

The following entity_types are configured to be extracted by default. The defaults can be overridden by passing in custom entity types to tailor the extraction process to your domain.

# lightrag/prompt.py
PROMPTS["DEFAULT_ENTITY_TYPES"] = ["organization", "person", "geo", "event", "category"]

# lightrag/operate.py
entity_types = global_config["addon_params"].get(
    "entity_types", PROMPTS["DEFAULT_ENTITY_TYPES"]
)

If few-shot examples are not explicitly provided, it will default to using a set of predefined few-shot examples in the prompt template located in:

# lightrag/prompt.py
PROMPTS["entity_extraction_examples"]

Each example follows the same format that we expect the LLM to return.

Sample text:

while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.

Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. "If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us."

The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.

It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths

Sample output:

# lightrag/prompt.py
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character...")
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Power dynamic..."{tuple_delimiter}"conflict"{tuple_delimiter}7)
("content_keywords"{tuple_delimiter}"discovery, control, rebellion")

The prompt uses placeholder delimiters (like “{tuple_delimiter}”), which are dynamically injected with defaults like:

# lightrag/prompt.py
PROMPTS["DEFAULT_TUPLE_DELIMITER"] = "<|>"
PROMPTS["DEFAULT_RECORD_DELIMITER"] = "##"
PROMPTS["DEFAULT_COMPLETION_DELIMITER"] = "<|COMPLETE|>"

A full extraction is formed by combining the elements detailed above.

# lightrag/prompt.py
PROMPTS["entity_extraction"] = """---Goal---
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
Use {language} as output language.

---Steps---
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)

4. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

5. When finished, output {completion_delimiter}

######################
---Examples---
######################
{examples}

#############################
---Real Data---
######################
Entity_types: [{entity_types}]
Text:
{input_text}
######################
Output:"""

Parse extraction results:

The raw final_result string from the LLM is further processed and returns two dictionaries:

maybe_nodes: entity name → list of entity dicts
maybe_edges: (source, target) → list of relationship dicts

# lightrag/operate.py
maybe_nodes, maybe_edges = await _process_extraction_result(
    final_result, chunk_key, file_path
)

final_result is subjected to further transformation in subsequent steps. Using the delimiters defined, LightRAG splits the full result into individual records. But what’s probably more interesting is that, unlike traditional entity and relationship extraction, LightRAG goes a step further in quantifying each relationship and characterizing them semantically.

Specifically, each extracted relationship includes:

relationship_strength — A numeric score indicating how strong or important the relationship is between the source and target entities. This allows us to model not just that two things are related, but how tightly, how frequently, or how significantly they co-occur or interact.

It’s worth noting that this relationship_strength score comes from the LLM’s interpretation, not hard data. It’s like asking someone, “How close do you think these two people are?” rather than counting actual interactions. The LLM is making an educated guess based on context.

relationship_keyword — One or more high-level keywords that summarize the nature of the relationship, capturing themes or concepts (e.g., “conflict”, “collaboration”, “influence”). These act as compact semantic tags that can be used for filtering, clustering, or graph visualization.

Together, these enrich the knowledge graph with contextual metadata that goes beyond just “Entity A is related to Entity B.” During retrieval, you can prioritize relationships based on weight (e.g., “only show me strong connections”), or discard noisy edges with low scores. The result is smarter graph traversal, more focused retrieval, and a better foundation for downstream reasoning.

Additionally, the central themes or topics present in the chunk are summarized as content_keywords. These reflect what the text is about overall and is not tied to any specific entity or relationship.

# Before
("entity"<|>"Alex"<|>"person"<|>"Alex is a character...")##
("relationship"<|>"Alex"<|>"Taylor"<|>"Power dynamic..."<|>"conflict"<|>"7")##
("content_keywords"<|>"discovery, control, rebellion")<|COMPLETE|>

# In Between
[
  '("entity"<|>"Alex"<|>"person"<|>"Alex is a character...")',
  '("relationship"<|>"Alex"<|>"Taylor"<|>"Power dynamic..."<|>"conflict"<|>"7")',
  '("content_keywords"<|>"discovery, control, rebellion")'
]

# After (Above: Entities, Below: Relationships)
["entity", "Alex", "person", "Alex is a character..."]

["relationship", "Alex", "Taylor", "Power dynamic...", "conflict", "7"]

Transform entities
Transform relationships

The processed records will be subjected to separate transformation based on whether it’s an entity or relationship. The records will then have metadata information attached to them.

# maybe_nodes
{
  "entity_name": "Alex",
  "entity_type": "person",
  "description": "Alex is a character...",
  "source_id": "chunk-123",
  "file_path": "file1.txt"
}

# maybe_edges
{
  "src_id": "Alex",
  "tgt_id": "Taylor",
  "description": "Power dynamic...",
  "keywords": "conflict",
  "weight": 7.0,
  "source_id": "chunk-345",
  "file_path": "file1.txt"
}

Gleaning Loop (Optional Retry for Low-Confidence Chunks)

Gleaning and retry extraction
Merge new entities and relationships
Check extraction loop
Combine all results

There might be cases where the LLM misses entities or relationships in its first pass, especially in dense text chunks. LightRAG includes a simple retry mechanism called gleaning, where it prompts the LLM again (up to a configurable limit) to extract anything that might have been previously overlooked. If new entities or relationships are found, they’ll be appended to the overall results. Notice that the prompt used in the gleaning phase is more aggressive and assertive than the previous extraction prompt in order to force it to cover all ground.

# lightrag/prompt.py
PROMPTS["entity_continue_extraction"] = """
MANY entities and relationships were missed in the last extraction.

---Remember Steps---

1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)

4. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

5. When finished, output {completion_delimiter}

---Output---

Add them below using the same format:\n
""".strip()

Final Merge, De-Duplication

Merge and summarize entities
Merge and summarize relationships

After the gleaning phase, LightRAG shifts gear into aggregation mode. This is where it combines the pieces, cleans them up, and saves them into the knowledge graph and the vector database.

All occurrences of the same entities are combined and grouped across chunks. For example, if “Alex” appears in three different chunks with different descriptions, all three entries will now be grouped under all_nodes[“Alex”].

Relationships are consolidated across chunks to give a consistent, canonical key regardless of the order they were extracted. This essentially treats relationships as bidirectional.

# lightrag/operate.py
for edge_key, edges in maybe_edges.items():
    sorted_edge_key = tuple(sorted(edge_key))
    all_edges[sorted_edge_key].extend(edges)

It makes sure that similar edges are treated as the same relationship.

# From
("Alex", "Taylor")
("Taylor", "Alex")

# To
sorted(("Alex", "Taylor")) → ("Alex", "Taylor")
sorted(("Taylor", "Alex")) → ("Alex", "Taylor")

Ingestion to Knowledge Graph and Vector Database

Once all chunk-level entity and relationship extractions are complete, LightRAG proceeds to merge, de-duplicate, and upsert this information into the knowledge graph and vector database. Here’s how that process unfolds.

Processing Nodes (Entities)

Each entity is retrieved using the entity_name attribute from the knowledge graph. If a matching node already exists, its existing metadata is retrieved.

Picking the dominant entity_type

The entity_type with the highest frequency count is picked using a frequency counter (e.g., “person” has the highest frequency of occurrence here).

# example 
entity_types = ["person", "character", "person", "persona", "person", "protagonist", "character"]

# lightrag/operate.py
entity_type = sorted(
    Counter(
        [dp["entity_type"] for dp in nodes_data] + already_entity_types
    ).items(),
    key=lambda x: x[1],
    reverse=True,
)[0][0]

# result
entity_type = "person"

Combining description

description values are combined using ||| as a separator to preserve context, if the number of fragments merged exceeds the default threshold of 6, an LLM is used to summarize them into one concise description.

merged_description = (
  "A highly observant character involved in power dynamics.|||Alex is a character..."
)

This balances completeness with readability, avoiding bloat in the knowledge graph.

Combining source_id and file_path

source_id and file_path are combined; these fields aren’t summarized because they’re used for traceability and provenance.

source_id = "chunk-123|||chunk-456"
file_path = "file1.txt|||file2.txt"

Processing Relationships

Relationships are processed similarly where each edge (src_id, tgt_id) pair is looked up in the graph. If a matching relationship already exists, its existing metadata is retrieved.

Combining descriptions and keywords

Descriptions and keywords from all edge mentions are concatenated. Similarly, if the number of fragments merged exceeds the threshold defined, an LLM is used to summarize them into one concise description.

{
  "description": "Power dynamic...|||Mentorship dynamic between Taylor and Alex.",
  "keywords": "conflict, guidance, imbalance"
}

Aggregating weight

Each relationship has a weight, representing how strong or confidently it was extracted. When the same relationship appears multiple times, weights are added up.

weight = 7.0 + 5.0 = 12.0

Combining source_id and file_path

Again, all source_id and file_path values are preserved for full traceability.

Upserting Nodes/Relationships Into the Knowledge Graph

LightRAG supports Neo4j as a graph store and uses Cypher queries to insert or update entities and relationships. This makes the graph queryable and explainable.

By default, LightRAG uses a generic base label for all nodes. All extracted attributes (like entity_type, description, etc.) are stored as key-value properties on that node.

// Upsert of nodes
MERGE (n:base {entity_id: $entity_id})
SET n += $properties
SET n:`%s`

In my opinion, this approach makes it easier to search and index. However, semantic clarity could be improved by using entity_type as the label instead of just base (e.g., :Person, :Technology, or :Organization makes the graph easier to query contextually in Cypher).

// Upsert of relationships
MATCH (source:base {entity_id: $source_entity_id})
WITH source
MATCH (target:base {entity_id: $target_entity_id})
MERGE (source)-[r:DIRECTED]-(target)
SET r += $properties
RETURN r, source, target

Upserting Into Vector Database

To enable semantic search, LightRAG also embeds each entity and relationship into a vector and stores it in a vector database.

For each entity and relationship, it builds a content string:

# For entities
content = f"{entity_name}\n{description}"
# e.g., "Alex\nAlex is a character..."

# For relationship
content = f"{src_id}\t{tgt_id}\n{keywords}\n{description}"
# e.g., "Alex\tTaylor\nconflict, guidance\nPower dynamic between them"

This content field is what gets embedded into a high-dimensional vector via an embedding model.

Each of these is stored in a separate vector document.

# entities
{
  "_id": "ent-8b14c7...",
  "entity_name": "Alex",
  "entity_type": "person",
  "content": "Alex\nAlex is a character who is highly observant of power dynamics.",
  "source_id": "chunk-123|||chunk-456",
  "file_path": "file1.txt|||file2.txt",
  "vector": [0.015, -0.782, 0.431, ...] 
}

# relationships
{
  "_id": "rel-7f91de...",
  "src_id": "Alex",
  "tgt_id": "Taylor",
  "keywords": "conflict, guidance",
  "content": "...",
  "source_id": "chunk-345|||chunk-567",
  "file_path": "file1.txt|||file2.txt",
  "vector": [0.73, -0.46, 0.2, ...] 
}

By persisting knowledge in both a graph database and a vector database, LightRAG unlocks a powerful hybrid retrieval model similar to GraphRAG, which we will discuss more in detail in the next series.

# lightrag/operate.py

if entity_vdb is not None and entities_data:
    data_for_vdb = {
        compute_mdhash_id(dp["entity_name"], prefix="ent-"): {
            "entity_name": dp["entity_name"],
            "entity_type": dp["entity_type"],
            "content": f"{dp['entity_name']}\n{dp['description']}",
            "source_id": dp["source_id"],
            "file_path": dp.get("file_path", "unknown_source"),
        }
        for dp in entities_data
    }
    await entity_vdb.upsert(data_for_vdb)

if relationships_vdb is not None and relationships_data:
    data_for_vdb = {
        compute_mdhash_id(dp["src_id"] + dp["tgt_id"], prefix="rel-"): {
            "src_id": dp["src_id"],
            "tgt_id": dp["tgt_id"],
            "keywords": dp["keywords"],
            "content": f"{dp['src_id']}\t{dp['tgt_id']}\n{dp['keywords']}\n{dp['description']}",
            "source_id": dp["source_id"],
            "file_path": dp.get("file_path", "unknown_source"),
        }
        for dp in relationships_data
    }
    await relationships_vdb.upsert(data_for_vdb)

Neo4j offers the capabilities of both a knowledge graph and a vector database, enabling hybrid retrieval within a single platform.

Summary

The authors of LightRAG have introduced a smart step forward in how we connect siloed, unstructured documents and transform them into a rich web of connected knowledge.

Key takeaways: The extraction happens in three stages: cleaning up the mess, breaking content into digestible chunks, and letting LLM identify the important entities and how they connect. By storing this information both as a knowledge graph and in a vector database, it understands meaning and can explain its reasoning.

Having beautifully structured knowledge is akin to having a Ferrari in your garage. It’s impressive, but useless if you don’t have the keys (retrieval) to drive it effectively.

Next Steps

Don’t miss the second installment of Under the Covers With LightRAG, where we’ll dive into the retrieval process. You’ll discover how LightRAG combines the best of both worlds — graph intelligence and semantic search — to deliver responses that are actually trustworthy and traceable back to their sources.

Ready to take LightRAG for a spin? Check it out on GitHub, drop me a comment below, or connect with me via email or on LinkedIn.

See you next time!

Under the Covers With LightRAG: Extraction was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.