Entity Linking and Relationship Extraction With Relik in LlamaIndex

Graph ML and GenAI Research, Neo4j
7 min read


Environment Setup
I suggest you use a separate Python environment like Google Colab, as we will have to play around with dependencies a bit. The models are faster on GPU, so you can use a GPU-powered runtime if you have the Pro version. Additionally, we need to set up Neo4j, a native graph database, to store the extracted information. There are many ways to set up your database instance. However, I recommend using Neo4j Aura, which provides a free cloud instance that can easily be accessed from a Google Colab notebook. Neo4j Aura – Fully Managed Cloud Solution After the database has been created, we can define a connection using LlamaIndex:from llama_index.graph_stores.neo4j import Neo4jPGStore username="neo4j" password="rubber-cuffs-radiator" url="bolt://54.89.19.156:7687" graph_store = Neo4jPGStore( username=username, password=password, url=url, refresh_schema=False )
Dataset
We will use a news dataset I obtained via Diffbot API some time ago. The dataset is conveniently available on GitHub for us to reuse:import pandas as pd NUMBER_OF_ARTICLES = 100 news = pd.read_csv( "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" ) news = news.head(NUMBER_OF_ARTICLES)
Coreference Resolution
The first step in the pipeline is a coreference resolution model. Coreference resolution is the task of identifying all expressions in a text refer to the same entity. To my knowledge, there aren’t many open-source models available for coreference resolution. I tried the maverick-coref, but in my tests Coreferee from spaCy worked better, so we will use that. The only disadvantage of using Coreferee is that we have to deal with dependency hell, which is solved in the notebook, but we’ll not go through it here. You can load the coreference model in spaCy with the following code:import spacy, coreferee coref_nlp = spacy.load('en_core_web_lg') coref_nlp.add_pipe('coreferee')The Coreferee model detects clusters of expression that refer to the same entity or entities. To rewrite the text based on these clusters, we have to implement our own function:
def coref_text(text): coref_doc = coref_nlp(text) resolved_text = "" for token in coref_doc: repres = coref_doc._.coref_chains.resolve(token) if repres: resolved_text += " " + " and ".join( [ t.text if t.ent_type_ == "" else [e.text for e in coref_doc.ents if t in e][0] for t in repres ] ) else: resolved_text += " " + token.text return resolved_textLet’s test the function to make sure the models and dependencies are set up properly:
print( coref_text("Tomaz is so cool. He can solve various Python dependencies and not cry") ) # Tomaz is so cool . Tomaz can solve various Python dependencies and not cryIn this example, the model identified that “Tomaz” and “He” refer to the same entity. Using the coref_text function, we replace “He” with “Tomaz.” Note that the rewriting doesn’t always return grammatically correct sentences due to using simple replace logic for entities within the cluster. However, it should be good enough for most scenarios. Now we apply the coreference resolution to our news dataset and wrap the results as LlamaIndex documents:
from llama_index.core import Document news["coref_text"] = news["text"].apply(coref_text) documents = [ Document(text=f"{row['title']}: {row['coref_text']}") for i, row in news.iterrows() ]
Entity Linking and Relationship Extraction
Relik is a library with models for entity linking (EL) and relationship extraction (RE), and it also supports models that combine the two. In entity linking, Wikipedia is used as the target knowledge base to map entities in text to their corresponding entries in the encyclopedia.

from llama_index.extractors.relik.base import RelikPathExtractor relik = RelikPathExtractor( model="relik-ie/relik-relation-extraction-small" ) # Use on Pro Collab with GPU # relik = RelikPathExtractor( # model="relik-ie/relik-cie-small", model_config={"skip_metadata": True, "device":"cuda"} # )Additionally, we have to define the embedding model that will be used to embed entities and the LLM for question-answering flow:
import os from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI os.environ["OPENAI_API_KEY"] = "sk-" llm = OpenAI(model="gpt-4o", temperature=0.0) embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")Note that the LLM will not be used during graph construction. Now that we have everything in place, we can instantiate a PropertyGraphIndex and use the news documents as input data to a knowledge graph. Additionally, we need pass the relik model as the kg_extractors value to extract the relationships:
from llama_index.core import PropertyGraphIndex index = PropertyGraphIndex.from_documents( documents, kg_extractors=[relik], llm=llm, embed_model=embed_model, property_graph_store=graph_store, show_progress=True, )After constructing the graph, you can open Neo4j Browser to validate the imported graph. You should get a similar visualization by running the following Cypher statement:
MATCH p=(:__Entity__)--(:__Entity__) RETURN p LIMIT 250Results

Question Answering
Using LlamaIndex, it is now easy to perform question answering. To use the default graph retrievers, you can ask questions as straightforward as:query_engine = index.as_query_engine(include_text=True) response = query_engine.query("What happened at Ryanair?") print(str(response))Here is where the defined LLM and embedding model come into play. Of course, you can also implement custom retrievers for potentially better accuracy.