1 of 9

Possible Data Scraping Methods

By Selena Shew

January 16, 2023

2 of 9

Background Information

Overall Goal: To scrape this book for data tables on seed germination

Book Details:

1,241 pages total
Compilation of many different researchers, leading to discrepancies in formatting of the data tables

Data Tables to Scrape:

Main Game Plan

Option 1: The Old-Fashioned Way (AKA Manual Labour)

How this works:

We make a Google spreadsheet and give everyone access to it
Divide up the 1,241 pages equally amongst everyone in the lab
Everyone goes through their allotted sections and manually enters the data into the Google spreadsheet

Option 2: Tabulizer Package

Issues

Example Code & Output

Option 3: Pdftools Package

Issues

Github documentation is incomplete (had to hunt through StackOverflow to figure out how to set it up on Windows)
Dependent on rpoppler and Poppler on local machine (need to set up new environment)
Struggles to identify tables vs. text

Example Code & Output

Option 4: Python Methods (Tabula, Camelot)

Tabula

Camelot

Option 5: Rvest Package

Issues

Since the HTML formatting varies from table to table, we’ll need to manually determine the relevant CSS element associated with each table in order to scrape

Example Code & Output

THANK YOU FOR LISTENING

We would greatly appreciate your input & thoughts!