Beer and pdftools - a vignette

A guide to extracting tables from many PDFs using the pdftools package

Thomas Mock https://themockup.netlify.com/
2020-04-04

Table of Contents


Scraping Complex Tables from PDFs with PDF Tools

The goal of this is to provide a guide to extracting irregularly formatted tables from PDFs.

Load Libraries

We’ll use ROpenSci’s pdftools package along with several tidyverse packages: - stringr - text manipulation - dplyr - general data manipulation - tidyr - data cleaning - purrr - repeated application of a function


library(tidyverse)
library(pdftools)

PDFs

The PDFs for this guide come from Alcohol and Tobacco Tax and Trade Bureau. We’ll use the 2011-2014 data for this example (84 total PDFs). For the purpose of today the files have already been downloaded, but I used the following script.


# General function for download
download_monthly_stats_pdf <- function(year){
  
  message(paste0("Downloading ", year))
  
  # The general format is yearmonth like 201101 for Jan 2011.
  month_in <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
  
  year_vec <- rep(year, 12)
  
  url_build <- function(year_vec, month_in){
      glue::glue("https://www.ttb.gov/images/pdfs/statistics/{year}/{year}{month_in}beer.pdf")
  }
  
  # output to the pdfs folder
  download_monthly_pdf <- function(year, month, url_in){
    download.file(
      url = url_in,
      destfile = glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")
      )
  }
  
  # build all the input urls and attach to an input dataframe
  full_df <- tibble(year = year_vec, month = month_in) %>% 
    mutate(url_in = pmap_chr(.l = list(year_vec, month_in), .f = url_build)) 
  
  # The pwalk here takes all 3 inputs and applies them to download_monthly_pdf function
  pwalk(full_df, .f = download_monthly_pdf)
  
}

We could apply that function to all the years of interest with another purrr::walk() call. This will run download_monthly_stats_pdf() for 2011, 2012, 2013, and 2014.


walk(c(2011:2014), download_monthly_stats_pdf)

List File Names


# list all the files we have downloaded so far
all_files <- list.files("pdfs")

length(all_files)

[1] 48

We have 48 PDFs, as expected - 12 months x 4 years = 48!

Now let’s take a peek inside one of the PDFs.

Raw PDFs

When we run pdftools::pdf_text() we can see a decently formatted table. The main issue to consider is that there is a lot erroneous header descriptions, and there are unequal spacing between “columns” in the table. Importantly, each line of the PDF is separated by a newline \n. This is key to our strategy for pulling out individual lines.


pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf")

[1] "                                                                                                                  Report Date:\n                                              DEPARTMENT OF THE TREASURY                                          30-MAR-2011\n                                    ALCOHOL AND TOBACCO TAX AND TRADE BUREAU\n                                                                                                                  Report Symbol:\n                                                    STATISTICAL REPORT - BEER                                     TTB S 5130-01-2011\n                                                       Reporting Period: January 2011                             Page: 1 of 1\n                                                                                                   Current Year         Prior Year\n                                                                                    Prior Year      Cumulative         Cumulative\nMANUFACTURE OF BEER                                      Current Month          Current Month      Year to Date        Year to Date\nProduction                                                    14,981,472               15,012,331     14,981,472           15,012,331\nRemovals\nTaxable ($7.00/$18.00 per barrel)\n   In bottles and cans                                        11,571,819               11,908,922     11,571,819           11,908,922\n   In barrels and kegs                                          1,245,125                1,245,143      1,245,125            1,245,143\n   Tax Determined, Premises Use                                     5,989                    5,267          5,989                5,267\n       Sub Total Taxable                                      12,822,933               13,159,332     12,822,933           13,159,332\nTax-free\n   For export                                                     264,669                  224,066        264,669              224,066\n   For vessels and aircraft                                             0                        0              0                    0\n   Consumed on brewery premises                                       886                      913            886                  913\n       Sub Total Tax-Free                                         265,555                  224,979        265,555              224,979\n   Total Removals                                             13,088,488               13,384,311     13,088,488           13,384,311\nStocks On Hand end-of-month:                                    9,896,961                9,993,268      9,896,961            9,993,268\nMATERIALS USED AT BREWERIES\n   Malt and malt products                                    322,480,722             330,304,432     322,480,722          330,304,432\n   Corn and corn products                                     58,632,672               56,705,162     58,632,672           56,705,162\n   Rice and rice products                                    108,112,318               59,701,345    108,112,318           59,701,345\n   Barley and barley products                                   4,705,175                3,668,374      4,705,175            3,668,374\n   Wheat and wheat products                                     1,210,137                1,409,685      1,210,137            1,409,685\n      Total Grain products                                   495,141,024             451,788,998     495,141,024          451,788,998\n   Sugar and syrups                                           73,793,509               47,308,358     73,793,509           47,308,358\n   Hops (dry)                                                   6,059,066                4,765,924      6,059,066            4,765,924\n   Hops (used as extracts)                                        296,605                  271,405        296,605              271,405\n   Other                                                        7,972,930              10,537,742       7,972,930          10,537,742\n       Total Non-Grain products                               88,122,110               62,883,429     88,122,110           62,883,429\nTotal Used                                                   583,263,134             514,672,427     583,263,134          514,672,427\n         296,605 Pounds of hops is equivalent to           212,541   pounds of extract JAN 2011\n         271,405 Pounds of hops is equivalent to           101,087   pounds of extract JAN 2010\nNOTE: Changes in figures from prior reports could be due to amended reports being filed.\n        This data is not final and may need to be amended.\nhttp://www.ttb.gov\n"

Split by row

We can use stringr::str_split() to separate the text at each of the \n newlines. This generates a list of character strings, we call unlist() to extract to a vector. We now have a nicely separated vector of character strings, where each row is a new string.


raw_text <- pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf") %>% 
  str_split("\n") %>% 
  unlist()

raw_text

 [1] "                                                                                                                  Report Date:"        
 [2] "                                              DEPARTMENT OF THE TREASURY