New AWS Software Competencies — Financial, Auto, GenAI, and ML | Learn Now

Neo4j logo

Nodes2024

Dev Conference by Neo4j

Register for NODES 24

You only need to register once to attend all sessions.

From Image to Graph: Leveraging Tesseract OCR Engine for Document Chunking and GraphRAG

Session Track: AI

Session Time:

Session description

In this session, you will learn how to take images and translate their content into a graph representation leveraging the Tesseract OCR Engine. Using the location of words identified by Tesseract, you will learn how to create a hierarchy of document chunk nodes--level, block, paragraph, and line. By having a hierarchy of chunks, you will be able to easily traverse different chunk sizes that relate to the same information. This can prove beneficial for RAG, where smaller chunks tend to be better for vector similarity and larger chunks tend to serve as better context for question and answer.

Speaker

photo of Kim Adler

Kim Adler

Manager, Pfizer

Kim Adler is currently a Data Translator Manager within the Digital Manufacturing Operations & Insights (O&I) Group at Pfizer. In this role, she serves as product owner and data scientist for an internal search engine for operations data that lives within the O&I Knowledge Graph database. Prior to joining Pfizer full-time, Kim completed a community detection capstone project with the O&I group as part of the Masters in Business Analytics program at MIT Sloan School of Management. When not in deep thought about graphs, Kim can be found going on long walks with her dog, Sophie.