Data Transformation Tool
A flexible tool for collecting archival data.
Presentation Outline:
Problem:
Many archival records are digitized (in pdf or png format) but digitized text must be transformed into meaningful data.
Problem:
OCR information:
Text, x0, y0, x1, y1,
block no, line no, word no
Need:
Column and Row of text
Problem:
OCR information:
Text, x0, y0, x1, y1,
block no, line no, word no
Need:
Column and Row of text
Problem:
This transformation must happen at a large scale (ie. the 1951 Census contains 285 District Handbooks * ~38 full pgs per District * ~50 villages per pg = ~ 541,500 villages… and 26 to 32 data pts per village so ~ 14 million data pts).
Problem:
Problem:
NARA has committed to digitize 500 million pages of records and make them available online to the public through the National Archives Catalog by October 1, 2024. Indian Railways will digitize 25 million pages of its records. Dramatic increase in digitized records makes this problem even more of a priority.
Use Case: 1951 Census Data Reclamation
Digitized text meaningful data.
Extracting village Census data from the 1951 Census to create a baseline test for the Canals paper, studying the long run effects of India’s canal network.
Use Case: 1951 Census Data Reclamation
Highlights:
Room for Growth:
Data Collection Tool:
Layout Detection
Column Detection
Row Detection
Digitized text meaningful data.
The tool works effectively across a range of layouts making it feasible for large-scale projects.
Each component is customizable which further expands accuracy of collected data and number of potential use cases.
PDF or PNG
Cropped PDF or PNG
Text assigned to columns
Text assigned to rows, columns
Layout Detection:
Tool: Layout Parser
Layout Detection:
Column Detection:
Tool: Scikit-Learn
Input
Output
Row Detection:
Tool: Ali’s ocr tooling w/ Bayes
Row Detection:
Weight @ 1
Weight @ .1
Concept: Mix and Match Design
3 Key Components:
Layout Detection
Column Detection
Row Detection
- ID different layout objects (ie. table, list, text region using existing Layout Parser models)
- Customize Layout Parser models to id adtl layout objects (ie. specific portion of a pg)
- Data point (ie. x0 or x1 coord of ocr’d text) used for column id
- Key indicating rows and their extent
- Distance to use between observations (ie. avg, min, max)
- Distance metric (ie. manhattan, euclidean)
- Distance threshold for distinct columns
- Weight given to new info when updating row extent
Potential Next Steps: