1 of 2

Retrieval from Large Document Image Collections

People involved: Riya Gupta, C.V. Jawahar ; Dataset & Server Handlers : Varun Bhargavan, Aradhana Vinod

Pipeline of our System

Dataset Snapshot

Demo Video

An IRE system using Basic NLP techniques to extract information from ancient document collection in Indic languages.
Documents are scanned, segmented using a line level segmentation algorithm and processed in the Indic-OCR and combined together to form this IRE system.
Languages supported : Hindi, Tamil, Telugu, Sanskrit, Bangla
Links to -

2 of 2

Retrieval from Large Document Image Collections

**People involved: Riya Gupta*, Siddhant Bansal*, C.V. Jawahar

Dataset Statistics

Word Level Segmentations

Future Work :

Using image preprocessing techniques & finetuning and self-training, improving the OCR accuracy.
Generation of word Level Segmentations** on the large dataset without manual supervision and extracting the resulted visualizations using them.
Enhancement of the IRE system semantically and contextually.

Word Segmentation

Model