2 of 20

Presentation Outline:

Problem statement
Use case: 1951 Census data reclamation
Overview of the tool
Deep dive into the three tool components: function and customizability
Potential next steps

3 of 20

Problem:

Many archival records are digitized (in pdf or png format) but digitized text must be transformed into meaningful data.

4 of 20

Problem:

OCR information:

Text, x0, y0, x1, y1,

block no, line no, word no

Need:

Column and Row of text

5 of 20

Problem:

OCR information:

Text, x0, y0, x1, y1,

block no, line no, word no

Need:

Column and Row of text

6 of 20

Problem:

This transformation must happen at a large scale (ie. the 1951 Census contains 285 District Handbooks * ~38 full pgs per District * ~50 villages per pg = ~ 541,500 villages… and 26 to 32 data pts per village so ~ 14 million data pts).

8 of 20

Problem:

NARA has committed to digitize 500 million pages of records and make them available online to the public through the National Archives Catalog by October 1, 2024. Indian Railways will digitize 25 million pages of its records. Dramatic increase in digitized records makes this problem even more of a priority.

9 of 20

Use Case: 1951 Census Data Reclamation

Digitized text meaningful data.

Extracting village Census data from the 1951 Census to create a baseline test for the Canals paper, studying the long run effects of India’s canal network.

12 of 20

Use Case: 1951 Census Data Reclamation

Highlights:

Collected data for ~ 75k villages in 48 Districts in States of MH and MP matched ~12k of these villages (now that we have tehsils, anticipate ~24k) (thanks Kritarth!!)
Effective at collecting data across a range of layout types
Graphing population data reveals majority expected values with extreme outliers

Room for Growth:

Variable assignment is thrown off by extraneous columns
Could scrape more data (ie. tehsil names had to be manually entered) (thanks JP and Kishan!!)

13 of 20

Data Collection Tool:

Layout Detection

Column Detection

Row Detection

Digitized text meaningful data.

The tool works effectively across a range of layouts making it feasible for large-scale projects.

Each component is customizable which further expands accuracy of collected data and number of potential use cases.

PDF or PNG

Cropped PDF or PNG

Text assigned to columns

Text assigned to rows, columns

14 of 20

Layout Detection:

Tool: Layout Parser

A wrapper that offers easy access to deep learning models for Document Image Analysis
Pre-trained models and custom models (training set size varies - TableBank is trained on ~400,000 labeled documents)
All models interface with detectron2, Facebook’s object detection library

15 of 20

Layout Detection:

16 of 20

Column Detection:

Tool: Scikit-Learn

Use AgglomerativeClustering method of sklearn to create clusters of xcoords, using xcoords from the input dataframe.
Sort the clusters by xcoordinate.
For each cluster in the sorted list, get the indices of xcoords in the cluster. Now at the same indices in the input dataframe, add the cluster number to a new ‘col’ column.

Input

Output

17 of 20

Row Detection:

Tool: Ali’s ocr tooling w/ Bayes

From the columns, id the column to use as a key for rows
Create and fill a dataframe with just information about rows
For each piece of text, identify its y coord extent and find the row which matches this extent. When the new text is assigned to the identified row, update the row information (extent, y0, y1, ymid) based on the new text)

18 of 20

Row Detection:

Weight @ 1

Weight @ .1

19 of 20

Concept: Mix and Match Design

3 Key Components:

Layout Detection

Column Detection

Row Detection

- ID different layout objects (ie. table, list, text region using existing Layout Parser models)

- Customize Layout Parser models to id adtl layout objects (ie. specific portion of a pg)

- Data point (ie. x0 or x1 coord of ocr’d text) used for column id

- Key indicating rows and their extent

- Distance to use between observations (ie. avg, min, max)

- Distance metric (ie. manhattan, euclidean)

- Distance threshold for distinct columns

- Weight given to new info when updating row extent

20 of 20

Potential Next Steps:

Identify and resolve row component integration issues (Update: this is completed).
Further develop methods of validating collected data
Automate parameter adjustment for each component (8 parameters total) to create the most accurate extraction tool for a given use case

1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

19 of 20

20 of 20