Beer and pdftools - a vignette

A guide to extracting tables from many PDFs using the pdftools package

Thomas Mock https://themockup.netlify.com/
2020-04-04

Table of Contents


Scraping Complex Tables from PDFs with PDF Tools

The goal of this is to provide a guide to extracting irregularly formatted tables from PDFs.

Load Libraries

We’ll use ROpenSci’s pdftools package along with several tidyverse packages: - stringr - text manipulation - dplyr - general data manipulation - tidyr - data cleaning - purrr - repeated application of a function


library(tidyverse)
library(pdftools)

PDFs

The PDFs for this guide come from Alcohol and Tobacco Tax and Trade Bureau. We’ll use the 2011-2014 data for this example (84 total PDFs). For the purpose of today the files have already been downloaded, but I used the following script.


# General function for download
download_monthly_stats_pdf <- function(year){
  
  message(paste0("Downloading ", year))
  
  # The general format is yearmonth like 201101 for Jan 2011.
  month_in <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
  
  year_vec <- rep(year, 12)
  
  url_build <- function(year_vec, month_in){
      glue::glue("https://www.ttb.gov/images/pdfs/statistics/{year}/{year}{month_in}beer.pdf")
  }
  
  # output to the pdfs folder
  download_monthly_pdf <- function(year, month, url_in){
    download.file(
      url = url_in,
      destfile = glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")
      )
  }
  
  # build all the input urls and attach to an input dataframe
  full_df <- tibble(year = year_vec, month = month_in) %>% 
    mutate(url_in = pmap_chr(.l = list(year_vec, month_in), .f = url_build)) 
  
  # The pwalk here takes all 3 inputs and applies them to download_monthly_pdf function
  pwalk(full_df, .f = download_monthly_pdf)
  
}

We could apply that function to all the years of interest with another purrr::walk() call. This will run download_monthly_stats_pdf() for 2011, 2012, 2013, and 2014.


walk(c(2011:2014), download_monthly_stats_pdf)

List File Names


# list all the files we have downloaded so far
all_files <- list.files("pdfs")

length(all_files)

[1] 48

We have 48 PDFs, as expected - 12 months x 4 years = 48!

Now let’s take a peek inside one of the PDFs.