Beer and pdftools - a vignette

A guide to extracting tables from many PDFs using the pdftools package

Thomas Mock

Table of Contents

Scraping Complex Tables from PDFs with PDF Tools

The goal of this is to provide a guide to extracting irregularly formatted tables from PDFs.

Load Libraries

We’ll use ROpenSci’s pdftools package along with several tidyverse packages: - stringr - text manipulation - dplyr - general data manipulation - tidyr - data cleaning - purrr - repeated application of a function



The PDFs for this guide come from Alcohol and Tobacco Tax and Trade Bureau. We’ll use the 2011-2014 data for this example (84 total PDFs). For the purpose of today the files have already been downloaded, but I used the following script.

# General function for download
download_monthly_stats_pdf <- function(year){
  message(paste0("Downloading ", year))
  # The general format is yearmonth like 201101 for Jan 2011.
  month_in <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
  year_vec <- rep(year, 12)
  url_build <- function(year_vec, month_in){
  # output to the pdfs folder
  download_monthly_pdf <- function(year, month, url_in){
      url = url_in,
      destfile = glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")
  # build all the input urls and attach to an input dataframe
  full_df <- tibble(year = year_vec, month = month_in) %>% 
    mutate(url_in = pmap_chr(.l = list(year_vec, month_in), .f = url_build)) 
  # The pwalk here takes all 3 inputs and applies them to download_monthly_pdf function
  pwalk(full_df, .f = download_monthly_pdf)

We could apply that function to all the years of interest with another purrr::walk() call. This will run download_monthly_stats_pdf() for 2011, 2012, 2013, and 2014.

walk(c(2011:2014), download_monthly_stats_pdf)

List File Names

# list all the files we have downloaded so far
all_files <- list.files("pdfs")


[1] 48

We have 48 PDFs, as expected - 12 months x 4 years = 48!

Now let’s take a peek inside one of the PDFs.

Raw PDFs

When we run pdftools::pdf_text() we can see a decently formatted table. The main issue to consider is that there is a lot erroneous header descriptions, and there are unequal spacing between “columns” in the table. Importantly, each line of the PDF is separated by a newline \n. This is key to our strategy for pulling out individual lines.


[1] "                                                                                                                  Report Date:\n                                              DEPARTMENT OF THE TREASURY                                          30-MAR-2011\n                                    ALCOHOL AND TOBACCO TAX AND TRADE BUREAU\n                                                                                                                  Report Symbol:\n                                                    STATISTICAL REPORT - BEER                                     TTB S 5130-01-2011\n                                                       Reporting Period: January 2011                             Page: 1 of 1\n                                                                                                   Current Year         Prior Year\n                                                                                    Prior Year      Cumulative         Cumulative\nMANUFACTURE OF BEER                                      Current Month          Current Month      Year to Date        Year to Date\nProduction                                                    14,981,472               15,012,331     14,981,472