1 of 15

RAG ASSISTANT

A Personal Assistant

SACHIN JADHAV

RUTHWIK DASYAM

ZAHIR MAHAMMAD

2 of 15

CONTENTS

    • Project Scope
    • Problem
    • Related Work
    • Approach
    • Results & Findings

3 of 15

    • Everyone wants to avoid going through documents and wishes they had a tool that keeps track of their documents.
    • This tool AKA Assistant reads all your documents and answers you whenever you need something from those docs.

PROJECT SCOPE

Rather than giving a general answer, A Personalized assistant that

    • has access to your data
    • stays in your system
    • No share of information, hence protection
    • can answer queries from the documents
    • can reads and extract information even from images in the documents

4 of 15

PROBLEM

Goal: Retrieval-Augmented Generation

Develop a multi-modal foundation model that can retrieve and understand data from documents stored on local systems, regardless of document type (image, text, chart, or table). The assistant should:

    • Accurately read and answer queries based on the content of the retrieved documents.
    • Retrieve relevant embeddings and provide answers solely based on the documents.
    • Respond with “no information found” if the requested information is not present in the documents.

The solution should include a UI-based chatbot that runs locally, allowing users to interact via speech or text:

    • Speak to receive spoken answers.
    • Type queries to receive text-based responses.

5 of 15

RELATED WORK

1. Traditional OCR-based

      • Limited to text extraction without context preservation
      • Struggled with complex layouts and mixed content formats

Reference : Smith, R. (2007). An Overview of the Tesseract OCR Engine (ICADR 2007)

2. Poppler Library

      • Advanced PDF rendering and processing capabilities
      • Enables high-fidelity document conversion with visual element preservation
      • Foundation for modern PDF processing tools like pdf2image

Reference : Poppler Development Team. (2021). Poppler: PDF rendering library. In freedesktop.org

3. FAISS for textual content indexing

      • Implements hierarchical clustering and quantization for efficient retrieval
      • Enables billion-scale similarity search with GPU acceleration, optimized for large document collections

Reference : Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE

4. DONUT Huggingface model

      • Document understanding transformer for end-to-end OCR
      • Combines visual and textual understanding in a single model

Reference : Kim, G., Hong, S., et al. (2022). OCR-free Document Understanding Transformer. In European Conference on Computer Vision (ECCV 2022)

6 of 15

Text-Based RAG

APPROACH

    • Data Preparation
      • Load pdf documents [only text]
    • Embedding and Indexing
      • Creates and maintains a FAISS index for fast similarity search
    • Query Processing
      • Retrieves relevant context from the vector database
    • Response Generation
      • Generates a response using the pre-trained Mistral7B model

7 of 15

Image-Based RAG

APPROACH

    • PDF to Image Conversion
      • Load pdf documents [only text]
    • ColPali Multimodal Document Retrieval
      • Indexes and retrieves relevant document pages based on text queries
    • Qwen2 Visual Language Processing
      • Qwen2 model analyzes both the query and retrieved document images
    • Response Generation

Reference:

8 of 15

Combined Approach

APPROACH

    • PDF to Image Conversion
      • Load pdf documents [only text]
    • ColPali Multimodal Document Retrieval
      • Indexes and retrieves relevant document pages based on text queries
    • Qwen2 Visual Language Processing
      • Qwen2 model analyzes both the query and retrieved document images
    • Response Generation

Text in doc

Image in doc

9 of 15

RESULTS

Input- a folder containing pdfs

    • TECHicago_Magazine_Final.pdf
      • Magazine related to Chicago
    • strategic_plan.pdf
      • Magazine related to University of Maryland

The Chatbot retrives the image and text data from the pdf and stores the indices.

It then retrives the relevant context based on the user Query

10 of 15

RESULTS

User Query - What does the image represent in the quantum section of tech mag

11 of 15

RESULTS

User Query - What percentage of women owned startups in the world does chicago have

12 of 15

RESULTS

User Query - Which is in top 10 univ of Computer Science for undergrads

13 of 15

RESULTS AND FINDINGS

GUI for the Chatbot

14 of 15

RESULTS AND FINDINGS

Audio Input and Audio Output for the ChatBot

15 of 15

Thank you.