1 of 9

Possible Data Scraping Methods

By Selena Shew

January 16, 2023

2 of 9

Background Information

Overall Goal: To scrape this book for data tables on seed germination

Book Details:

  • 1,241 pages total
  • Compilation of many different researchers, leading to discrepancies in formatting of the data tables

Data Tables to Scrape:

  • Phenology of flowering and fruiting
  • Germination test conditions and results
  • Height, seed-bearing age, and seed crop frequency (potentially)

3 of 9

Main Game Plan

  • Convert PDF version of the book to HTML

  • Scrape the HTML version for the data tables

  • Export data tables to Excel

  • Clean the data

4 of 9

Option 1: The Old-Fashioned Way (AKA Manual Labour)

How this works:

  • We make a Google spreadsheet and give everyone access to it
  • Divide up the 1,241 pages equally amongst everyone in the lab
  • Everyone goes through their allotted sections and manually enters the data into the Google spreadsheet

5 of 9

Option 2: Tabulizer Package

Issues

  • Works off of Java
    • Does anyone in the lab know how to code in Java?
  • Needs the rJava package to provide R to Java bindings
  • Requires a LOT of data wrangling

Example Code & Output

  • tabulizer: an R package that can extract tables from a PDF document

6 of 9

Option 3: Pdftools Package

Issues

  • Github documentation is incomplete (had to hunt through StackOverflow to figure out how to set it up on Windows)
  • Dependent on rpoppler and Poppler on local machine (need to set up new environment)
  • Struggles to identify tables vs. text

Example Code & Output

  • pdftools: allows for extracting text and metadata from pdf files in R

7 of 9

Option 4: Python Methods (Tabula, Camelot)

Tabula

Camelot

  • tabula: simple Python wrapper of tabula-java, which can read tables in a PDF
  • camelot: Python library that helps to extract tables from PDF files

8 of 9

Option 5: Rvest Package

Issues

  • Since the HTML formatting varies from table to table, we’ll need to manually determine the relevant CSS element associated with each table in order to scrape

Example Code & Output

  • rvest: helps you scrape (or harvest) data from web pages in R

9 of 9

THANK YOU FOR LISTENING

We would greatly appreciate your input & thoughts!