Scraping Complex Tables from PDFs with PDF Tools

The goal of this is to provide a guide to extracting irregularly formatted tables from PDFs.

Load Libraries

We’ll use ROpenSci’s pdftools package along with several tidyverse packages: - stringr - text manipulation - dplyr - general data manipulation - tidyr - data cleaning - purrr - repeated application of a function

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.6     ✔ dplyr   1.0.8
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(pdftools)

Using poppler version 22.04.0

PDFs

The PDFs for this guide come from Alcohol and Tobacco Tax and Trade Bureau. We’ll use the 2011-2014 data for this example (84 total PDFs). For the purpose of today the files have already been downloaded, but I used the following script.

# General function for download
download_monthly_stats_pdf <- function(year){
  
  message(paste0("Downloading ", year))
  
  # The general format is yearmonth like 201101 for Jan 2011.
  month_in <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
  
  year_vec <- rep(year, 12)
  
  url_build <- function(year_vec, month_in){
      glue::glue("https://www.ttb.gov/images/pdfs/statistics/{year}/{year}{month_in}beer.pdf")
  }
  
  # output to the pdfs folder
  download_monthly_pdf <- function(year, month, url_in){
    download.file(
      url = url_in,
      destfile = glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")
      )
  }
  
  # build all the input urls and attach to an input dataframe
  full_df <- tibble(year = year_vec, month = month_in) %>% 
    mutate(url_in = pmap_chr(.l = list(year_vec, month_in), .f = url_build)) 
  
  # The pwalk here takes all 3 inputs and applies them to download_monthly_pdf function
  pwalk(full_df, .f = download_monthly_pdf)
  
}

We could apply that function to all the years of interest with another purrr::walk() call. This will run download_monthly_stats_pdf() for 2011, 2012, 2013, and 2014.

walk(c(2011:2014), download_monthly_stats_pdf)

List File Names

# list all the files we have downloaded so far
all_files <- list.files("pdfs")
length(all_files)

[1] 48

We have 48 PDFs, as expected - 12 months x 4 years = 48!

Now let’s take a peek inside one of the PDFs.

Raw PDFs

When we run pdftools::pdf_text() we can see a decently formatted table. The main issue to consider is that there is a lot erroneous header descriptions, and there are unequal spacing between “columns” in the table. Importantly, each line of the PDF is separated by a newline \n. This is key to our strategy for pulling out individual lines.

pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf")

[1] "                                                                                                                          Report Date:\n                                           DEPARTMENT OF THE TREASURY                                                     30-MAR-2011\n                                 ALCOHOL AND TOBACCO TAX AND TRADE BUREAU\n                                                                                                                          Report Symbol:\n                                                  STATISTICAL REPORT - BEER                                               TTB S 5130-01-2011\n\n                                                      Reporting Period: January 2011                                      Page: 1 of 1\n\n\n                                                                                                       Current Year             Prior Year\n                                                                                   Prior Year           Cumulative             Cumulative\nMANUFACTURE OF BEER                                     Current Month            Current Month         Year to Date            Year to Date\n\nProduction                                                    14,981,472              15,012,331          14,981,472               15,012,331\nRemovals\nTaxable ($7.00/$18.00 per barrel)\n  In bottles and cans                                         11,571,819              11,908,922          11,571,819               11,908,922\n  In barrels and kegs                                          1,245,125                   1,245,143       1,245,125                1,245,143\n  Tax Determined, Premises Use                                      5,989                     5,267            5,989                     5,267\n    Sub Total Taxable                                         12,822,933              13,159,332          12,822,933               13,159,332\nTax-free\n  For export                                                     264,669                    224,066          264,669                  224,066\n  For vessels and aircraft                                               0                        0                   0                       0\n  Consumed on brewery premises                                        886                       913              886                       913\n     Sub Total Tax-Free                                          265,555                    224,979          265,555                  224,979\n  Total Removals                                              13,088,488              13,384,311          13,088,488               13,384,311\nStocks On Hand end-of-month:                                   9,896,961                   9,993,268       9,896,961                9,993,268\n\n\nMATERIALS USED AT BREWERIES\n\n  Malt and malt products                                    322,480,722              330,304,432         322,480,722              330,304,432\n  Corn and corn products                                      58,632,672              56,705,162          58,632,672               56,705,162\n  Rice and rice products                                    108,112,318               59,701,345         108,112,318               59,701,345\n  Barley and barley products                                   4,705,175                   3,668,374       4,705,175                3,668,374\n  Wheat and wheat products                                     1,210,137                   1,409,685       1,210,137                1,409,685\n    Total Grain products                                    495,141,024              451,788,998         495,141,024              451,788,998\n\n  Sugar and syrups                                            73,793,509              47,308,358          73,793,509               47,308,358\n  Hops (dry)                                                   6,059,066                   4,765,924       6,059,066                4,765,924\n  Hops (used as extracts)                                        296,605                    271,405          296,605                  271,405\n  Other                                                        7,972,930              10,537,742           7,972,930               10,537,742\n     Total Non-Grain products                                 88,122,110              62,883,429          88,122,110               62,883,429\nTotal Used                                                  583,263,134              514,672,427         583,263,134              514,672,427\n\n\n        296,605 Pounds of hops is equivalent to          212,541     pounds of extract JAN 2011\n        271,405 Pounds of hops is equivalent to          101,087     pounds of extract JAN 2010\n\n\n\n\nNOTE: Changes in figures from prior reports could be due to amended reports being filed.\n     This data is not final and may need to be amended.\n\n\n\nhttp://www.ttb.gov\n"

Split by row

We can use stringr::str_split() to separate the text at each of the \n newlines. This generates a list of character strings, we call unlist() to extract to a vector. We now have a nicely separated vector of character strings, where each row is a new string.

raw_text <- pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf") %>% 
  str_split("\n") %>% 
  unlist()
raw_text

 [1] "                                                                                                                          Report Date:"         
 [2] "                                           DEPARTMENT OF THE TREASURY                                                     30-MAR-2011"          
 [3] "                                 ALCOHOL AND TOBACCO TAX AND TRADE BUREAU"                                                                      
 [4] "                                                                                                                          Report Symbol:"       
 [5] "                                                  STATISTICAL REPORT - BEER                                               TTB S 5130-01-2011"   
 [6] ""                                                                                                                                               
 [7] "                                                      Reporting Period: January 2011                                      Page: 1 of 1"         
 [8] ""                                                                                                                                               
 [9] ""                                                                                                                                               
[10] "                                                                                                       Current Year             Prior Year"     
[11] "                                                                                   Prior Year           Cumulative             Cumulative"      
[12] "MANUFACTURE OF BEER                                     Current Month            Current Month         Year to Date            Year to Date"    
[13] ""                                                                                                                                               
[14] "Production                                                    14,981,472              15,012,331          14,981,472               15,012,331"  
[15] "Removals"                                                                                                                                       
[16] "Taxable ($7.00/$18.00 per barrel)"                                                                                                              
[17] "  In bottles and cans                                         11,571,819              11,908,922          11,571,819               11,908,922"  
[18] "  In barrels and kegs                                          1,245,125                   1,245,143       1,245,125                1,245,143"  
[19] "  Tax Determined, Premises Use                                      5,989                     5,267            5,989                     5,267" 
[20] "    Sub Total Taxable                                         12,822,933              13,159,332          12,822,933               13,159,332"  
[21] "Tax-free"                                                                                                                                       
[22] "  For export                                                     264,669                    224,066          264,669                  224,066"  
[23] "  For vessels and aircraft                                               0                        0                   0                       0"
[24] "  Consumed on brewery premises                                        886                       913              886                       913" 
[25] "     Sub Total Tax-Free                                          265,555                    224,979          265,555                  224,979"  
[26] "  Total Removals                                              13,088,488              13,384,311          13,088,488               13,384,311"  
[27] "Stocks On Hand end-of-month:                                   9,896,961                   9,993,268       9,896,961                9,993,268"  
[28] ""                                                                                                                                               
[29] ""                                                                                                                                               
[30] "MATERIALS USED AT BREWERIES"                                                                                                                    
[31] ""                                                                                                                                               
[32] "  Malt and malt products                                    322,480,722              330,304,432         322,480,722              330,304,432"  
[33] "  Corn and corn products                                      58,632,672              56,705,162          58,632,672               56,705,162"  
[34] "  Rice and rice products                                    108,112,318               59,701,345         108,112,318               59,701,345"  
[35] "  Barley and barley products                                   4,705,175                   3,668,374       4,705,175                3,668,374"  
[36] "  Wheat and wheat products                                     1,210,137                   1,409,685       1,210,137                1,409,685"  
[37] "    Total Grain products                                    495,141,024              451,788,998         495,141,024              451,788,998"  
[38] ""                                                                                                                                               
[39] "  Sugar and syrups                                            73,793,509              47,308,358          73,793,509               47,308,358"  
[40] "  Hops (dry)                                                   6,059,066                   4,765,924       6,059,066                4,765,924"  
[41] "  Hops (used as extracts)                                        296,605                    271,405          296,605                  271,405"  
[42] "  Other                                                        7,972,930              10,537,742           7,972,930               10,537,742"  
[43] "     Total Non-Grain products                                 88,122,110              62,883,429          88,122,110               62,883,429"  
[44] "Total Used                                                  583,263,134              514,672,427         583,263,134              514,672,427"  
[45] ""                                                                                                                                               
[46] ""                                                                                                                                               
[47] "        296,605 Pounds of hops is equivalent to          212,541     pounds of extract JAN 2011"                                                
[48] "        271,405 Pounds of hops is equivalent to          101,087     pounds of extract JAN 2010"                                                
[49] ""                                                                                                                                               
[50] ""                                                                                                                                               
[51] ""                                                                                                                                               
[52] ""                                                                                                                                               
[53] "NOTE: Changes in figures from prior reports could be due to amended reports being filed."                                                       
[54] "     This data is not final and may need to be amended."                                                                                        
[55] ""                                                                                                                                               
[56] ""                                                                                                                                               
[57] ""                                                                                                                                               
[58] "http://www.ttb.gov"                                                                                                                             
[59] ""

Build Table

Now that we have the data split into a vector we can start finding “rows” to drop. We can see that the 9th string is actually the column titles, and the table ends at the 36th string. However, this could change according to which PDF we are looking at, so rather than going by position we can use stringr::str_which() to match a logical with matched text.

# Start of table - column names
raw_text[9]

[1] ""

# End of table - last value
raw_text[36]

[1] "  Wheat and wheat products                                     1,210,137                   1,409,685       1,210,137                1,409,685"

We get the same “rows” with our matching str_which().

# find start of table
stringr::str_which(raw_text, "MANUFACTURE OF BEER")

[1] 12

# find end of table
stringr::str_which(raw_text, "Total Used")

[1] 44

Let’s actually assign this now, rather than just printing. We can also remove leading/trailing whitespace with stringr::str_trim(). When we look at table_trimmed we can “see” a group of text strings that much closer resemble a table!

table_start <- stringr::str_which(raw_text, "MANUFACTURE OF BEER")
  
# End of table (drop all the asterisks and the other descriptors)
table_end <- stringr::str_which(raw_text, "Total Used")
  
# Trim the table to the start/end and drop whitespace at each line
table_trimmed <- raw_text[table_start:table_end] %>% 
  str_trim()
table_trimmed

 [1] "MANUFACTURE OF BEER                                     Current Month            Current Month         Year to Date            Year to Date"  
 [2] ""                                                                                                                                             
 [3] "Production                                                    14,981,472              15,012,331          14,981,472               15,012,331"
 [4] "Removals"                                                                                                                                     
 [5] "Taxable ($7.00/$18.00 per barrel)"                                                                                                            
 [6] "In bottles and cans                                         11,571,819              11,908,922          11,571,819               11,908,922"  
 [7] "In barrels and kegs                                          1,245,125                   1,245,143       1,245,125                1,245,143"  
 [8] "Tax Determined, Premises Use                                      5,989                     5,267            5,989                     5,267" 
 [9] "Sub Total Taxable                                         12,822,933              13,159,332          12,822,933               13,159,332"    
[10] "Tax-free"                                                                                                                                     
[11] "For export                                                     264,669                    224,066          264,669                  224,066"  
[12] "For vessels and aircraft                                               0                        0                   0                       0"
[13] "Consumed on brewery premises                                        886                       913              886                       913" 
[14] "Sub Total Tax-Free                                          265,555                    224,979          265,555                  224,979"     
[15] "Total Removals                                              13,088,488              13,384,311          13,088,488               13,384,311"  
[16] "Stocks On Hand end-of-month:                                   9,896,961                   9,993,268       9,896,961                9,993,268"
[17] ""                                                                                                                                             
[18] ""                                                                                                                                             
[19] "MATERIALS USED AT BREWERIES"                                                                                                                  
[20] ""                                                                                                                                             
[21] "Malt and malt products                                    322,480,722              330,304,432         322,480,722              330,304,432"  
[22] "Corn and corn products                                      58,632,672              56,705,162          58,632,672               56,705,162"  
[23] "Rice and rice products                                    108,112,318               59,701,345         108,112,318               59,701,345"  
[24] "Barley and barley products                                   4,705,175                   3,668,374       4,705,175                3,668,374"  
[25] "Wheat and wheat products                                     1,210,137                   1,409,685       1,210,137                1,409,685"  
[26] "Total Grain products                                    495,141,024              451,788,998         495,141,024              451,788,998"    
[27] ""                                                                                                                                             
[28] "Sugar and syrups                                            73,793,509              47,308,358          73,793,509               47,308,358"  
[29] "Hops (dry)                                                   6,059,066                   4,765,924       6,059,066                4,765,924"  
[30] "Hops (used as extracts)                                        296,605                    271,405          296,605                  271,405"  
[31] "Other                                                        7,972,930              10,537,742           7,972,930               10,537,742"  
[32] "Total Non-Grain products                                 88,122,110              62,883,429          88,122,110               62,883,429"     
[33] "Total Used                                                  583,263,134              514,672,427         583,263,134              514,672,427"

Remove all the extra whitespace

Next we need to remove all the huge whitespaces from between columns. The regular expression (regex) of "\\s{2,}" matches whitespaces of 2 or more. If we use stringr::str_replace_all() to take all the whitespaces > 2 and replace with a new delimiter such as "|" we can move to our next step. While we’re at it, let’s remove all the commas so that we can go straight to doubles rather than characters for all the beer production variables.

# Replace long spaces with a col break symbol
squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>% 
  str_remove_all(",")
squished_table

 [1] "MANUFACTURE OF BEER|Current Month|Current Month|Year to Date|Year to Date"
 [2] ""                                                                         
 [3] "Production|14981472|15012331|14981472|15012331"                           
 [4] "Removals"                                                                 
 [5] "Taxable ($7.00/$18.00 per barrel)"                                        
 [6] "In bottles and cans|11571819|11908922|11571819|11908922"                  
 [7] "In barrels and kegs|1245125|1245143|1245125|1245143"                      
 [8] "Tax Determined Premises Use|5989|5267|5989|5267"                          
 [9] "Sub Total Taxable|12822933|13159332|12822933|13159332"                    
[10] "Tax-free"                                                                 
[11] "For export|264669|224066|264669|224066"                                   
[12] "For vessels and aircraft|0|0|0|0"                                         
[13] "Consumed on brewery premises|886|913|886|913"                             
[14] "Sub Total Tax-Free|265555|224979|265555|224979"                           
[15] "Total Removals|13088488|13384311|13088488|13384311"                       
[16] "Stocks On Hand end-of-month:|9896961|9993268|9896961|9993268"             
[17] ""                                                                         
[18] ""                                                                         
[19] "MATERIALS USED AT BREWERIES"                                              
[20] ""                                                                         
[21] "Malt and malt products|322480722|330304432|322480722|330304432"           
[22] "Corn and corn products|58632672|56705162|58632672|56705162"               
[23] "Rice and rice products|108112318|59701345|108112318|59701345"             
[24] "Barley and barley products|4705175|3668374|4705175|3668374"               
[25] "Wheat and wheat products|1210137|1409685|1210137|1409685"                 
[26] "Total Grain products|495141024|451788998|495141024|451788998"             
[27] ""                                                                         
[28] "Sugar and syrups|73793509|47308358|73793509|47308358"                     
[29] "Hops (dry)|6059066|4765924|6059066|4765924"                               
[30] "Hops (used as extracts)|296605|271405|296605|271405"                      
[31] "Other|7972930|10537742|7972930|10537742"                                  
[32] "Total Non-Grain products|88122110|62883429|88122110|62883429"             
[33] "Total Used|583263134|514672427|583263134|514672427"

Convert to tibble

Now we have a nicely formatted vector of strings! We can use tibble::enframe() to create a dataframe/tibble out of the vector.

# Convert to tibble
raw_df <- enframe(squished_table)
raw_df

# A tibble: 33 × 2
    name value                                                                  
   <int> <chr>                                                                  
 1     1 "MANUFACTURE OF BEER|Current Month|Current Month|Year to Date|Year to …
 2     2 ""                                                                     
 3     3 "Production|14981472|15012331|14981472|15012331"                       
 4     4 "Removals"                                                             
 5     5 "Taxable ($7.00/$18.00 per barrel)"                                    
 6     6 "In bottles and cans|11571819|11908922|11571819|11908922"              
 7     7 "In barrels and kegs|1245125|1245143|1245125|1245143"                  
 8     8 "Tax Determined Premises Use|5989|5267|5989|5267"                      
 9     9 "Sub Total Taxable|12822933|13159332|12822933|13159332"                
10    10 "Tax-free"                                                             
# … with 23 more rows

Next we can separate value into the 5 columns. Notice that there are a few “rows” where the data is NA as there were rows that acted only as indicators of the type of beer production. We’ll use them later.

year <- 2011
month <- "02"
# Convert to tibble
beer_df <- raw_df %>% 
    separate(value, 
             into = c("type", "month_current", "month_prior_year", "ytd_current", "ytd_prior_year"), 
             sep = "\\|") %>% 
  slice(-1) %>% 
  mutate_at(vars(month_current:ytd_prior_year), as.double) %>% 
  mutate(year = as.integer(year), month = as.integer(month)) %>% 
  select(year, month, type, everything())

Warning: Expected 5 pieces. Missing pieces filled with `NA` in 9 rows [2, 4, 5,
10, 17, 18, 19, 20, 27].

beer_df

# A tibble: 32 × 8
    year month type              name month_current month_prior_year ytd_current
   <int> <int> <chr>            <int>         <dbl>            <dbl>       <dbl>
 1  2011     2 ""                   2            NA               NA          NA
 2  2011     2 "Production"         3      14981472         15012331    14981472
 3  2011     2 "Removals"           4            NA               NA          NA
 4  2011     2 "Taxable ($7.00…     5            NA               NA          NA
 5  2011     2 "In bottles and…     6      11571819         11908922    11571819
 6  2011     2 "In barrels and…     7       1245125          1245143     1245125
 7  2011     2 "Tax Determined…     8          5989             5267        5989
 8  2011     2 "Sub Total Taxa…     9      12822933         13159332    12822933
 9  2011     2 "Tax-free"          10            NA               NA          NA
10  2011     2 "For export"        11        264669           224066      264669
# … with 22 more rows, and 1 more variable: ytd_prior_year <dbl>

Technically at this point, we have successfully converted from raw text to a dataframe/table/tibble! HOWEVER, for many many examples in the wild you will need to do additional data cleaning, data manipulation, factor assignment, etc. As such, I’ll continue working on this to get to a final output. I’ll also work on repeating this many times as opposed to one time.

Alternative method via `readr`

Thanks to Grant McDermott for bringing up a good point here - I based this method off of tables where the white-space between columns is varying. If the white space is fixed between columns you could skip some steps as seen in the below example using readr::read_fwf(), courtesy of Grant. I believe for most cases either using readr::read_table() or readr::read_fwf() would be simple, but will keep the additional workflow steps in case they help someone down the road!

Overall, using readr to natively parse the table-format could save the workflow step of trimming, coercing to a tibble, and then separating, and just requires you to indicate the spacing of empty cells either manually with fwf_widths() or guessing/parsing of columns via fwf_empty(). Note that as of readrv2.0, you’ll need to wrap literal in-memory data with I() - full details in readr 2.0 announcement blogpost.

table_start_fwf <- stringr::str_which(raw_text, "Production") ## Changed since we're dropping the first row anyway
table_end_fwf <- stringr::str_which(raw_text, "Total Used")
## Trim the table to the start/end (NB: Don't drop whitespace this time!)
table_trimmed_fwf <- raw_text[table_start_fwf:table_end_fwf]
beer_df_fwf <- read_fwf(I(table_trimmed_fwf), 
                        fwf_empty(I(table_trimmed_fwf), 
                                  col_names = c("type", 
                                                "month_current", 
                                                "month_prior_year", 
                                                "ytd_current", 
                                                "ytd_prior_year")
                                  )
                        )

Rows: 31 Columns: 6
── Column specification ────────────────────────────────────────────────────────

chr (1): type
dbl (1): ytd_prior_year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

beer_df_fwf

# A tibble: 31 × 6
   type        month_current month_prior_year ytd_current ytd_prior_year      X6
   <chr>               <dbl>            <dbl>       <dbl>          <dbl>   <dbl>
 1 Production       14981472         15012331    14981472             NA  1.50e7
 2 Removals               NA               NA          NA             NA NA     
 3 Taxable ($…            NA               NA          NA             NA NA     
 4 In bottles…      11571819         11908922    11571819             NA  1.19e7
 5 In barrels…       1245125          1245143     1245125             NA  1.25e6
 6 Tax Determ…          5989             5267        5989             NA  5.27e3
 7 Sub Total …      12822933         13159332    12822933             NA  1.32e7
 8 Tax-free               NA               NA          NA             NA NA     
 9 For export         264669           224066      264669             NA  2.24e5
10 For vessel…             0                0          NA              0  0     
# … with 21 more rows

One more alternative would be to just use readr::read_table() or readr::read_table2(). Now in practice this should be fairly robust, and works just fine for the examples here, but for messier tables it may fail which leads to the more complex and longer workflow shown below. Specifically, from the readr::read_table() docs:

read_table() and read_table2() are designed to read the type of textual data where each column is separated by one (or more) > columns of space.

read_table2() is like read.table(), it allows any number of whitespace characters between columns, and the lines can be of different lengths.

read_table() is more strict, each line must be the same length, and each field is in the same position in every line. It first finds empty columns and then parses like a fixed width file.

read_table(raw_text[table_start:table_end], skip =1,
           col_names = c("type", "month_current", "month_prior_year", 
                         "ytd_current", "ytd_prior_year"))

Warning: 25 parsing failures.
row col  expected    actual         file
  2  -- 5 columns 1 columns literal data
  3  -- 5 columns 4 columns literal data
  4  -- 5 columns 8 columns literal data
  5  -- 5 columns 8 columns literal data
  6  -- 5 columns 8 columns literal data
... ... ......... ......... ............
See problems(...) for more details.

# A tibble: 27 × 5
   type       month_current month_prior_year ytd_current ytd_prior_year
   <chr>      <chr>         <chr>            <chr>                <dbl>
 1 Production 14,981,472    15,012,331       14,981,472        15012331
 2 Removals   <NA>          <NA>             <NA>                    NA
 3 Taxable    ($7.00/$18.00 per              barrel)                 NA
 4 In         bottles       and              cans              11571819
 5 In         barrels       and              kegs               1245125
 6 Tax        Determined,   Premises         Use                   5989
 7 Sub        Total         Taxable          12,822,933        13159332
 8 Tax-free   <NA>          <NA>             <NA>                    NA
 9 For        export        264,669          224,066             264669
10 For        vessels       and              aircraft                 0
# … with 17 more rows

Proper Cleaning

This is actually two datasets that are combined into one large reporting table. As such we need to identify the specific row/point to split the dataset at. We can filter to just the row that matches either the string “MATERIALS USED” or “IN POUNDS”, as that indicates a label starting the 2nd dataset.

slice_num <- beer_df %>% 
  # find a string that has MATERIALS USED or IN POUNDS
  # | means OR
    filter(str_detect(type, "MATERIALS USED|IN POUNDS")) %>%
    pull(name)
slice_num

[1] 19

Split dataframe

Next we will add a column based on logic for the slice_num, and assign a grouping variable for either Barrels Produced (dataset 1) or Pounds of Materials Used (dataset 2). We can then drop the unneeded rows with a filter(), group_by() the newly produced grouping variable, and use dplyr::group_split() to separate the combined dataset into a list of both datasets.

# split data into materials vs barrels produced
split_df <- beer_df %>% 
  mutate(data_type = ifelse(name >= slice_num, "Pounds of Materials Used", "Barrels Produced"),
         type = str_remove(type, ":")) %>% 
  select(data_type, everything(), -name) %>% 
  filter(!str_detect(type, "IN POUNDS|MATERIALS USED|MANUFACTURE OF BEER|BARRELS")) %>% 
  group_by(data_type) %>% 
  group_split()
glimpse(split_df)

list<tibble[,8]> [1:2] 
$ : tibble [17 × 8] (S3: tbl_df/tbl/data.frame)
$ : tibble [14 × 8] (S3: tbl_df/tbl/data.frame)
@ ptype: tibble [0 × 8] (S3: tbl_df/tbl/data.frame)

Factor cleaning and final dataframes

We can see that the split_df object is a list of 2 tibbles/dataframes. We can now operate on the individual dataframes and finalize the factor cleaning and assignment to make the data a bit tidier and analysis ready.

manufacture_df <- split_df[[1]] %>% 
  mutate(
    tax_status = case_when(
      type %in% c("In bottles and cans", "In kegs", "In barrels and kegs",
                  "Tax Determined, Premises Use") ~ "Taxable",
      type == "Sub Total Taxable" ~ "Sub Total Taxable",
      type %in% c("For export", "For vessels and aircraft", 
                  "Consumed on brewery premises") ~ "Tax Free",
      type == "Sub Total Tax-Free" ~ "Sub Total Tax-Free",
      type %in% c("Production", "Total Removals", 
                  "Stocks On Hand end-of-month:") ~ "Totals"
      ),
    tax_rate = dplyr::if_else(year <= 2017, "$7/$18 per barrel", "$3.50/$16 per barrel")
    ) %>% 
  filter(!is.na(tax_status)) %>% 
  select(data_type, tax_status, everything())

# clean up the material dataset
material_df <- split_df[[2]] %>% 
  mutate(
    material_type = case_when(
      str_detect(type, "Malt|Corn|Rice|Barley|Wheat") ~ "Grain Products",
      str_detect(type, "Sugar|Hops|Other") ~ "Non-Grain Products",
      str_detect(type, "Total") ~ type
    )
  ) %>% 
  select(data_type, material_type, everything())

Print the dataframes

The manufacture dataframe now has the labels, factors, etc separated into nice columns, with the 4x columns for specific barrels produced.

manufacture_df

# A tibble: 9 × 10
  data_type        tax_status    year month type  month_current month_prior_year
  <chr>            <chr>        <int> <int> <chr>         <dbl>            <dbl>
1 Barrels Produced Totals        2011     2 Prod…      14981472         15012331
2 Barrels Produced Taxable       2011     2 In b…      11571819         11908922
3 Barrels Produced Taxable       2011     2 In b…       1245125          1245143
4 Barrels Produced Sub Total T…  2011     2 Sub …      12822933         13159332
5 Barrels Produced Tax Free      2011     2 For …        264669           224066
6 Barrels Produced Tax Free      2011     2 For …             0                0
7 Barrels Produced Tax Free      2011     2 Cons…           886              913
8 Barrels Produced Sub Total T…  2011     2 Sub …        265555           224979
9 Barrels Produced Totals        2011     2 Tota…      13088488         13384311
# … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
#   tax_rate <chr>

The material dataframe now has the labels, factors, etc separated into nice columns, with the 4x columns for specific pounds of product used.

material_df

# A tibble: 14 × 9
   data_type      material_type  year month type  month_current month_prior_year
   <chr>          <chr>         <int> <int> <chr>         <dbl>            <dbl>
 1 Pounds of Mat… <NA>           2011     2 ""               NA               NA
 2 Pounds of Mat… Grain Produc…  2011     2 "Mal…     322480722        330304432
 3 Pounds of Mat… Grain Produc…  2011     2 "Cor…      58632672         56705162
 4 Pounds of Mat… Grain Produc…  2011     2 "Ric…     108112318         59701345
 5 Pounds of Mat… Grain Produc…  2011     2 "Bar…       4705175          3668374
 6 Pounds of Mat… Grain Produc…  2011     2 "Whe…       1210137          1409685
 7 Pounds of Mat… Total Grain …  2011     2 "Tot…     495141024        451788998
 8 Pounds of Mat… <NA>           2011     2 ""               NA               NA
 9 Pounds of Mat… Non-Grain Pr…  2011     2 "Sug…      73793509         47308358
10 Pounds of Mat… Non-Grain Pr…  2011     2 "Hop…       6059066          4765924
11 Pounds of Mat… Non-Grain Pr…  2011     2 "Hop…        296605           271405
12 Pounds of Mat… Non-Grain Pr…  2011     2 "Oth…       7972930         10537742
13 Pounds of Mat… Total Non-Gr…  2011     2 "Tot…      88122110         62883429
14 Pounds of Mat… Total Used     2011     2 "Tot…     583263134        514672427
# … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>

Finished Cleaning

We have now finished cleaning the manufacting and material dataframes! However, we did this all line-by-line without functions and would need to repeat this for the other 47 PDFs! Let’s convert ALL that code into a function that outputs the final dataframes.

Use a function

# create a function that works for most years
get_beer_tables <- function(year, month) {
  
  # read in the raw PDF
  raw_text <- pdftools::pdf_text(glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")) %>%
    str_split("\n") %>%
    unlist()
  ## Build Table
  # find start of table
  table_start <- stringr::str_which(raw_text, "MANUFACTURE OF BEER")
  # End of table (drop all the asterisks and the other descriptors)
  table_end <- stringr::str_which(raw_text, "Total Used")
  # Trim the table to the start/end and drop whitespace at each line
  table_trimmed <- raw_text[table_start:table_end] %>%
    str_trim()
  table_trimmed
  ### Remove all the extra whitespace
  # Replace long spaces with a col break symbol
  squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>%
    str_remove_all(",")
  ### Convert to tibble
  # Convert to tibble
  raw_df <- enframe(squished_table)
  
  # split the rows into their columns
  beer_df <- suppressWarnings(raw_df %>%
    separate(value,
      into = c("type", "month_current", "month_prior_year", "ytd_current", "ytd_prior_year"),
      sep = "\\|"
    ) %>%
    slice(-1) %>%
    mutate_at(vars(month_current:ytd_prior_year), as.double) %>%
    mutate(year = as.integer(year), month = as.integer(month)) %>%
    select(year, month, type, everything()))
  ### Proper Cleaning
  # ID the specific row/point to split the dataset at.
  slice_num <- beer_df %>%
    # find a string that has MATERIALS USED or IN POUNDS
    # | means OR
    filter(str_detect(type, "MATERIALS USED|IN POUNDS")) %>%
    pull(name)
  #### Split dataframe
  # split data into materials vs barrels produced
  split_df <- suppressWarnings(beer_df %>%
    mutate(
      data_type = ifelse(name >= slice_num, "Pounds of Materials Used", "Barrels Produced"),
      type = str_remove(type, ":")
    ) %>%
    select(data_type, everything(), -name) %>%
    filter(!str_detect(type, "IN POUNDS|MATERIALS USED|MANUFACTURE OF BEER|BARRELS")) %>%
    group_by(data_type) %>%
    group_split())
  #### Factor cleaning and final dataframes
  # clean manufacture df
  manufacture_df <- split_df[[1]] %>%
    mutate(
      tax_status = case_when(
        type %in% c(
          "In bottles and cans", "In kegs", "In barrels and kegs",
          "Tax Determined, Premises Use"
        ) ~ "Taxable",
        type == "Sub Total Taxable" ~ "Sub Total Taxable",
        type %in% c(
          "For export", "For vessels and aircraft",
          "Consumed on brewery premises"
        ) ~ "Tax Free",
        type == "Sub Total Tax-Free" ~ "Sub Total Tax-Free",
        type %in% c(
          "Production", "Total Removals",
          "Stocks On Hand end-of-month:"
        ) ~ "Totals"
      ),
      tax_rate = dplyr::if_else(year <= 2017, "$7/$18 per barrel", "$3.50/$16 per barrel")
    ) %>%
    filter(!is.na(tax_status)) %>%
    select(data_type, tax_status, everything())
  # clean up the material dataset
  material_df <- split_df[[2]] %>%
    mutate(
      material_type = case_when(
        str_detect(type, "Malt|Corn|Rice|Barley|Wheat") ~ "Grain Products",
        str_detect(type, "Sugar|Hops|Other") ~ "Non-Grain Products",
        str_detect(type, "Total") ~ type
      )
    ) %>%
    select(data_type, material_type, everything())
  # output a list of both dfs
  list(manufacture_df, material_df)
}

Really the only code we have changed is we added a glue call to add the year, month to which PDF to read in, and we have the output as a list of both dataframes. Let’s test our function!

get_beer_tables(2011, "01")

[[1]]
# A tibble: 9 × 10
  data_type        tax_status    year month type  month_current month_prior_year
  <chr>            <chr>        <int> <int> <chr>         <dbl>            <dbl>
1 Barrels Produced Totals        2011     1 Prod…      14981472         15012331
2 Barrels Produced Taxable       2011     1 In b…      11571819         11908922
3 Barrels Produced Taxable       2011     1 In b…       1245125          1245143
4 Barrels Produced Sub Total T…  2011     1 Sub …      12822933         13159332
5 Barrels Produced Tax Free      2011     1 For …        264669           224066
6 Barrels Produced Tax Free      2011     1 For …             0                0
7 Barrels Produced Tax Free      2011     1 Cons…           886              913
8 Barrels Produced Sub Total T…  2011     1 Sub …        265555           224979
9 Barrels Produced Totals        2011     1 Tota…      13088488         13384311
# … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
#   tax_rate <chr>

[[2]]
# A tibble: 14 × 9
   data_type      material_type  year month type  month_current month_prior_year
   <chr>          <chr>         <int> <int> <chr>         <dbl>            <dbl>
 1 Pounds of Mat… <NA>           2011     1 ""               NA               NA
 2 Pounds of Mat… Grain Produc…  2011     1 "Mal…     322480722        330304432
 3 Pounds of Mat… Grain Produc…  2011     1 "Cor…      58632672         56705162
 4 Pounds of Mat… Grain Produc…  2011     1 "Ric…     108112318         59701345
 5 Pounds of Mat… Grain Produc…  2011     1 "Bar…       4705175          3668374
 6 Pounds of Mat… Grain Produc…  2011     1 "Whe…       1210137          1409685
 7 Pounds of Mat… Total Grain …  2011     1 "Tot…     495141024        451788998
 8 Pounds of Mat… <NA>           2011     1 ""               NA               NA
 9 Pounds of Mat… Non-Grain Pr…  2011     1 "Sug…      73793509         47308358
10 Pounds of Mat… Non-Grain Pr…  2011     1 "Hop…       6059066          4765924
11 Pounds of Mat… Non-Grain Pr…  2011     1 "Hop…        296605           271405
12 Pounds of Mat… Non-Grain Pr…  2011     1 "Oth…       7972930         10537742
13 Pounds of Mat… Total Non-Gr…  2011     1 "Tot…      88122110         62883429
14 Pounds of Mat… Total Used     2011     1 "Tot…     583263134        514672427
# … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>

Boom! Function is working for our example, let’s try it out with more than 1 input via purrr!

`purrr` - iteration without repetition

We’ll be using pmap() to apply our function multiple times, where pmap can take any number of inputs. For example if we call get_beer_tables() via pmap, we can get our tables for that 1 year/month combo!

# Quick test of purrr
pmap(list(2011, "02"), get_beer_tables)

[[1]]
[[1]][[1]]
# A tibble: 9 × 10
  data_type        tax_status    year month type  month_current month_prior_year
  <chr>            <chr>        <int> <int> <chr>         <dbl>            <dbl>
1 Barrels Produced Totals        2011     2 Prod…      14350832         14297845
2 Barrels Produced Taxable       2011     2 In b…      11509264         11847648
3 Barrels Produced Taxable       2011     2 In b…       1234147          1308677
4 Barrels Produced Sub Total T…  2011     2 Sub …      12749447         13161061
5 Barrels Produced Tax Free      2011     2 For …        279150           263721
6 Barrels Produced Tax Free      2011     2 For …             0                0
7 Barrels Produced Tax Free      2011     2 Cons…           942              719
8 Barrels Produced Sub Total T…  2011     2 Sub …        280092           264440
9 Barrels Produced Totals        2011     2 Tota…      13029539         13425501
# … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
#   tax_rate <chr>

[[1]][[2]]
# A tibble: 14 × 9
   data_type      material_type  year month type  month_current month_prior_year
   <chr>          <chr>         <int> <int> <chr>         <dbl>            <dbl>
 1 Pounds of Mat… <NA>           2011     2 ""               NA               NA
 2 Pounds of Mat… Grain Produc…  2011     2 "Mal…     307076591        305543380
 3 Pounds of Mat… Grain Produc…  2011     2 "Cor…      53981943         54486996
 4 Pounds of Mat… Grain Produc…  2011     2 "Ric…      54287863         55198940
 5 Pounds of Mat… Grain Produc…  2011     2 "Bar…       4322047          3726630
 6 Pounds of Mat… Grain Produc…  2011     2 "Whe…        955671          1683401
 7 Pounds of Mat… Total Grain …  2011     2 "Tot…     420624115        420639347
 8 Pounds of Mat… <NA>           2011     2 ""               NA               NA
 9 Pounds of Mat… Non-Grain Pr…  2011     2 "Sug…      63374850         46718854
10 Pounds of Mat… Non-Grain Pr…  2011     2 "Hop…       7617974          4899463
11 Pounds of Mat… Non-Grain Pr…  2011     2 "Hop…        275963           268589
12 Pounds of Mat… Non-Grain Pr…  2011     2 "Oth…       7997916          9022834
13 Pounds of Mat… Total Non-Gr…  2011     2 "Tot…      79266703         60909740
14 Pounds of Mat… Total Used     2011     2 "Tot…     499890818        481549087
# … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>

However our goal is all the inputs at once! We can create a vector of the month inputs as character strings, and then use tidyr::crossing() to output all the possible combinations of year + month as a dataframe. Notice two columns, year and month with a length of 48 - equal to all of our PDFs!

# add the month_num as vector
month_num <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
# use crossing to generate all combos for the data 
# 2010 is missing, but as the data has prior year data we can theoretically
# add it back in after the fact
crossing(
  year = c(2011:2014), 
  month = month_num
  ) %>% glimpse()

Rows: 48
Columns: 2
$ year  <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
$ month <chr> "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11"…

All possible combos

We can use tidyr::crossing() again to generate the possible inputs and create the output dataframes as list column of two dataframes. Running this takes only about 2 seconds across the 48 PDFs! The output is not very exciting as the data is simply the year & month columns, plus a list-column called data. Let’s get the final outputs!

# add the month_num as vector
month_num <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
# use crossing to generate all combos for the data 
df_2011_2014 <- crossing(year = c(2011:2014), 
                         month = month_num) %>% 
  mutate(data = pmap(., get_beer_tables))
df_2011_2014

# A tibble: 48 × 3
    year month data      
   <int> <chr> <list>    
 1  2011 01    <list [2]>
 2  2011 02    <list [2]>
 3  2011 03    <list [2]>
 4  2011 04    <list [2]>
 5  2011 05    <list [2]>
 6  2011 06    <list [2]>
 7  2011 07    <list [2]>
 8  2011 08    <list [2]>
 9  2011 09    <list [2]>
10  2011 10    <list [2]>
# … with 38 more rows

Final output

We can now get just the output data, drop the other columns. We’re still working with list-columns, so let’s get to the manufacture_df and material_df.

final_output <- df_2011_2014 %>%
  # grab the data into respective columns
  mutate(manufacture_data = map(data, 1),
         material_data = map(data, 2)) %>% 
  select(manufacture_data, material_data)
final_output

# A tibble: 48 × 2
   manufacture_data  material_data    
   <list>            <list>           
 1 <tibble [9 × 10]> <tibble [14 × 9]>
 2 <tibble [9 × 10]> <tibble [14 × 9]>
 3 <tibble [9 × 10]> <tibble [14 × 9]>
 4 <tibble [9 × 10]> <tibble [14 × 9]>
 5 <tibble [9 × 10]> <tibble [14 × 9]>
 6 <tibble [9 × 10]> <tibble [14 × 9]>
 7 <tibble [9 × 10]> <tibble [14 × 9]>
 8 <tibble [9 × 10]> <tibble [14 × 9]>
 9 <tibble [9 × 10]> <tibble [14 × 9]>
10 <tibble [9 × 10]> <tibble [14 × 9]>
# … with 38 more rows

The manufacture dataframe can be combined as below.

# Grab just the manufacture data
manufacture_df <- final_output %>% 
  select(manufacture_data) %>% 
  unnest(manufacture_data)
# Grab just the manufacture data
material_df <- final_output %>% 
  select(material_data) %>% 
  unnest(material_data)

And now we can look at the outputs!

Manufacture dataset

glimpse(manufacture_df)

Rows: 432
Columns: 10
$ data_type        <chr> "Barrels Produced", "Barrels Produced", "Barrels Prod…
$ tax_status       <chr> "Totals", "Taxable", "Taxable", "Sub Total Taxable", …
$ year             <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
$ month            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ type             <chr> "Production", "In bottles and cans", "In barrels and …
$ month_current    <dbl> 14981472, 11571819, 1245125, 12822933, 264669, 0, 886…
$ month_prior_year <dbl> 15012331, 11908922, 1245143, 13159332, 224066, 0, 913…
$ ytd_current      <dbl> 14981472, 11571819, 1245125, 12822933, 264669, 0, 886…
$ ytd_prior_year   <dbl> 15012331, 11908922, 1245143, 13159332, 224066, 0, 913…
$ tax_rate         <chr> "$7/$18 per barrel", "$7/$18 per barrel", "$7/$18 per…

Material dataset

glimpse(material_df)

Rows: 637
Columns: 9
$ data_type        <chr> "Pounds of Materials Used", "Pounds of Materials Used…
$ material_type    <chr> NA, "Grain Products", "Grain Products", "Grain Produc…
$ year             <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
$ month            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,…
$ type             <chr> "", "Malt and malt products", "Corn and corn products…
$ month_current    <dbl> NA, 322480722, 58632672, 108112318, 4705175, 1210137,…
$ month_prior_year <dbl> NA, 330304432, 56705162, 59701345, 3668374, 1409685, …
$ ytd_current      <dbl> NA, 322480722, 58632672, 108112318, 4705175, 1210137,…
$ ytd_prior_year   <dbl> NA, 330304432, 56705162, 59701345, 3668374, 1409685, …

Do it all in 6 Lines of Code!

Now all of that could have been done in about 6 lines of tidyverse code since we created a function.

# Use crossing to generate all 48 combos for the data 
# Use purrr to read in, clean, and output the 96 tables from the 48 PDFs
final_output <- crossing(year = c(2011:2014), month = month_num) %>% 
  mutate(data = pmap(., get_beer_tables)) %>% 
  mutate(manufacture_data = map(data, 1), material_data = map(data, 2)) %>% 
  select(manufacture_data, material_data)
# Grab just the manufacture data
manufacture_df <- final_output %>% select(manufacture_data) %>% unnest(manufacture_data)
# Grab just the manufacture data
material_df <- final_output %>% select(material_data) %>% unnest(material_data)

Expand for Session Info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.0 (2022-04-22)
 os       macOS Monterey 12.2.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Chicago
 date     2022-04-28
 pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
 quarto   0.9.294 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 dplyr       * 1.0.8   2022-02-08 [1] CRAN (R 4.2.0)
 forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.2.0)
 ggplot2     * 3.3.5   2021-06-25 [1] CRAN (R 4.2.0)
 pdftools    * 3.1.1   2022-02-25 [1] CRAN (R 4.2.0)
 purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
 readr       * 2.1.2   2022-01-30 [1] CRAN (R 4.2.0)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
 stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
 tibble      * 3.1.6   2021-11-07 [1] CRAN (R 4.2.0)
 tidyr       * 1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
 tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.2.0)

 [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────