1 of 2

Retrieval from Large Document Image Collections

People involved: Riya Gupta, C.V. Jawahar ; Dataset & Server Handlers : Varun Bhargavan, Aradhana Vinod

Pipeline of our System

Dataset Snapshot

Demo Video

  • An IRE system using Basic NLP techniques to extract information from ancient document collection in Indic languages.
  • Documents are scanned, segmented using a line level segmentation algorithm and processed in the Indic-OCR and combined together to form this IRE system.
  • Languages supported : Hindi, Tamil, Telugu, Sanskrit, Bangla
  • Links to -
    • Code
    • Project website

2 of 2

Retrieval from Large Document Image Collections

**People involved: Riya Gupta*, Siddhant Bansal*, C.V. Jawahar

Dataset Statistics

Word Level Segmentations

Future Work :

  • Using image preprocessing techniques & finetuning and self-training, improving the OCR accuracy.
  • Generation of word Level Segmentations** on the large dataset without manual supervision and extracting the resulted visualizations using them.
  • Enhancement of the IRE system semantically and contextually.

  • Count of dataset : >1.5m doc. images
  • Collaborators : NDLI (IIT KGP) & British Library
  • Current Features :
  • Query Level Search
  • Language Level Search
  • Relevance and Visualization
  • Filtering
  • Transliteration

Word Segmentation

Model