A guide to extracting tables from many PDFs using the pdftools package
The goal of this is to provide a guide to extracting irregularly formatted tables from PDFs.
We’ll use ROpenSci’s pdftools
package along with several tidyverse
packages: - stringr
- text manipulation - dplyr
- general data manipulation - tidyr
- data cleaning - purrr
- repeated application of a function
library(tidyverse)
library(pdftools)
The PDFs for this guide come from Alcohol and Tobacco Tax and Trade Bureau. We’ll use the 2011-2014 data for this example (84 total PDFs). For the purpose of today the files have already been downloaded, but I used the following script.
# General function for download
download_monthly_stats_pdf <- function(year){
message(paste0("Downloading ", year))
# The general format is yearmonth like 201101 for Jan 2011.
month_in <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
year_vec <- rep(year, 12)
url_build <- function(year_vec, month_in){
glue::glue("https://www.ttb.gov/images/pdfs/statistics/{year}/{year}{month_in}beer.pdf")
}
# output to the pdfs folder
download_monthly_pdf <- function(year, month, url_in){
download.file(
url = url_in,
destfile = glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")
)
}
# build all the input urls and attach to an input dataframe
full_df <- tibble(year = year_vec, month = month_in) %>%
mutate(url_in = pmap_chr(.l = list(year_vec, month_in), .f = url_build))
# The pwalk here takes all 3 inputs and applies them to download_monthly_pdf function
pwalk(full_df, .f = download_monthly_pdf)
}
We could apply that function to all the years of interest with another purrr::walk()
call. This will run download_monthly_stats_pdf()
for 2011, 2012, 2013, and 2014.
walk(c(2011:2014), download_monthly_stats_pdf)
# list all the files we have downloaded so far
all_files <- list.files("pdfs")
length(all_files)
[1] 48
We have 48 PDFs, as expected - 12 months x 4 years = 48!
Now let’s take a peek inside one of the PDFs.