Flipping tibbles for many models

espnscrapeR NFL tidyverse tidymodels

Pivoting data from wide to long to run many models at once

Thomas Mock
05-01-2020

Many Models

The tidymodels package advocates for a nest-map-unnest workflow for running many models at once. While this typically is used for some type of group as seen in the tidymodels docs, we can also do it for running many models at once from a wide dataset.

Our goal is to get all of the measures into a long form tibble so that we can fit all of the models at once, and plot it all at once.

Basic Example

This basic example is borrowed directly from the tidymodels docs.

First you nest the data by a grouping variable to get list-columns of each split data/tibble.


mtcars <- as_tibble(mtcars)  # to play nicely with list-cols

nest_mtcars <- mtcars %>%
  nest(data = c(-am)) 

nest_mtcars

# A tibble: 2 x 2
     am data              
  <dbl> <list>            
1     1 <tibble [13 × 10]>
2     0 <tibble [19 × 10]>

Now you can apply a lm() call w/ purrr::map() to each dataset, and then broom::tidy() the model output!


nest_mtcars %>% 
  mutate(
    fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)),  # S3 list-col
    tidied = map(fit, tidy)
  ) %>% 
  unnest(tidied) %>% 
  select(-data, -fit)

# A tibble: 8 x 6
     am term        estimate std.error statistic  p.value
  <dbl> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1     1 (Intercept)   4.28      3.46      1.24   0.247   
2     1 mpg          -0.101     0.0294   -3.43   0.00750 
3     1 qsec          0.0398    0.151     0.264  0.798   
4     1 gear         -0.0229    0.349    -0.0656 0.949   
5     0 (Intercept)   4.92      1.40      3.52   0.00309 
6     0 mpg          -0.192     0.0443   -4.33   0.000591
7     0 qsec          0.0919    0.0983    0.935  0.365   
8     0 gear          0.147     0.368     0.398  0.696   

Now each of the model metrics for automatic vs manual transmissions (0 vs 1) is easy to work with! We’ll use a similar approach (nest-map-unnest) albeit with a slightly different data structure for our following analysis.

NFL Example

We’ll be performing a similar analysis to what Josh Hermsmeyer’s did in his 538 article. The raw code for his analysis is available on his GitHub. Essentially he evaluated how stable metrics were year-over-year, by comparing:

We’re skipping most of the nflscrapR aggregation (see the link for full script), the portion we can change a bit to make our lives easier is the repeated modeling.

Rather than having to generate a model for every metric one-by-one, we can generate the models for ALL the metrics in the datase at once, while still running the model just for each metric by the following year’s metric.


### QB Hits model  ------------------------------------------------------------

qb_hits_model <- lm(data = team_defense_joined, qb_hit.y ~ qb_hit.x)

qb_hits_stability <- glance(qb_hits_model) %>%
   mutate(metric = "QB Hits",
          r.squared = round(r.squared, 3)) %>%
   select(metric, r.squared)

Get the data

So let’s load our libraries and get to modeling!


library(tidyverse) # all the things
library(espnscrapeR) # NFL data summaries
library(broom) # tidy modeling
library(glue) # nice string creation
library(ggtext) # for formatted text in ggplot

First we’ll get all the data from the 2000-2019 seasons for NFL offenses via espnscrapeR::scrape_team_stats_nfl().


# Get data from espnscrapeR
all_off <- 2000:2019 %>% 
  map_dfr(scrape_team_stats_nfl)

glimpse(all_off)

Rows: 626
Columns: 19
$ season         <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ role           <chr> "OFFENSE", "OFFENSE", "OFFENSE", "OFFENSE", …
$ stat           <chr> "PASSING", "PASSING", "PASSING", "PASSING", …
$ team           <chr> "Redskins", "Buccaneers", "Seahawks", "49ers…
$ pass_att       <int> 561, 433, 507, 583, 578, 439, 554, 575, 637,…
$ pass_comp      <int> 343, 237, 308, 366, 311, 217, 316, 331, 352,…
$ pass_comp_pct  <dbl> 0.611, 0.547, 0.607, 0.628, 0.538, 0.494, 0.…
$ yds_att        <dbl> 6.9, 6.5, 6.3, 7.5, 6.1, 6.2, 6.3, 5.9, 6.3,…
$ pass_yds       <int> 3892, 2824, 3198, 4400, 3540, 2738, 3478, 33…
$ pass_td        <int> 18, 18, 21, 32, 19, 12, 16, 21, 23, 22, 22, …
$ int            <int> 21, 13, 21, 10, 30, 10, 24, 15, 29, 13, 15, …
$ pass_rating    <dbl> 77.0, 76.2, 75.5, 97.0, 61.8, 68.9, 67.4, 75…
$ first_downs    <int> 185, 144, 168, 211, 156, 128, 156, 182, 192,…
$ pass_first_pct <dbl> 0.330, 0.333, 0.331, 0.362, 0.270, 0.292, 0.…
$ pass_20plus    <int> 43, 38, 26, 61, 41, 37, 41, 46, 50, 46, 42, …
$ pass_40plus    <int> 10, 5, 7, 9, 9, 3, 11, 5, 8, 7, 8, 5, 13, 8,…
$ pass_long      <chr> "77T", "75", "71", "69T", "83T", "77T", "70T…
$ sacks          <int> 32, 38, 46, 25, 53, 43, 35, 45, 20, 28, 39, …
$ sack_yds       <int> 244, 241, 238, 161, 302, 220, 228, 262, 99, …

Pivot the data longer

Because we are looking to run all the models at once, we’ll need to take the data structure from wide to longer, so we can nest() the datasets by metric and run the model on each metric pair. By pivoting to long format we get our team-level data by season and metric with the corresponding value of each season.


long_off_stats <- all_off %>% 
  select(team, pass_att:sacks, season,-pass_long, -pass_comp) %>% 
  mutate(season2 = season + 1) %>% 
  pivot_longer(
    cols = c(-team, -season, -season2), 
    names_to = "metric", 
    values_to = "value")

long_off_stats

# A tibble: 7,512 x 5
   team     season season2 metric            value
   <chr>     <int>   <dbl> <chr>             <dbl>
 1 Redskins   2000    2001 pass_att        561    
 2 Redskins   2000    2001 pass_comp_pct     0.611
 3 Redskins   2000    2001 yds_att           6.9  
 4 Redskins   2000    2001 pass_yds       3892    
 5 Redskins   2000    2001 pass_td          18    
 6 Redskins   2000    2001 int              21    
 7 Redskins   2000    2001 pass_rating      77    
 8 Redskins   2000    2001 first_downs     185    
 9 Redskins   2000    2001 pass_first_pct    0.33 
10 Redskins   2000    2001 pass_20plus      43    
# … with 7,502 more rows

Join the data

Next we need to join the data back into itself to get the matched value 1 (year) with value 2 (year + 1). The join renames value on the left-hand side (value.x) and the right-hand side (value.y). Technically we don’t need season or season 2 anymore, but I’ve kept them so we can do a quick visual check on the data. The numbers look good and are aligned properly!


join_years <- long_off_stats %>% 
    inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>% 
    select(everything(), -season2.y)

join_years

# A tibble: 7,116 x 6
   team     season season2 metric          value.x  value.y
   <chr>     <int>   <dbl> <chr>             <dbl>    <dbl>
 1 Redskins   2000    2001 pass_att        561      432    
 2 Redskins   2000    2001 pass_comp_pct     0.611    0.544
 3 Redskins   2000    2001 yds_att           6.9      6.3  
 4 Redskins   2000    2001 pass_yds       3892     2716    
 5 Redskins   2000    2001 pass_td          18       13    
 6 Redskins   2000    2001 int              21       13    
 7 Redskins   2000    2001 pass_rating      77       71.1  
 8 Redskins   2000    2001 first_downs     185      122    
 9 Redskins   2000    2001 pass_first_pct    0.33     0.282
10 Redskins   2000    2001 pass_20plus      43       31    
# … with 7,106 more rows

Nest the Data

We now need to nest the data, leaving metric out so that it is used as the group/label data. Now each of the metrics are separated into their own respective nested datasets!


nest_off_data <- join_years %>% 
    nest(data = c(-metric))

nest_off_data 

# A tibble: 12 x 2
   metric         data              
   <chr>          <list>            
 1 pass_att       <tibble [593 × 5]>
 2 pass_comp_pct  <tibble [593 × 5]>
 3 yds_att        <tibble [593 × 5]>
 4 pass_yds       <tibble [593 × 5]>
 5 pass_td        <tibble [593 × 5]>
 6 int            <tibble [593 × 5]>
 7 pass_rating    <tibble [593 × 5]>
 8 first_downs    <tibble [593 × 5]>
 9 pass_first_pct <tibble [593 × 5]>
10 pass_20plus    <tibble [593 × 5]>
11 pass_40plus    <tibble [593 × 5]>
12 sacks          <tibble [593 × 5]>

Fit the models

Now let’s fit the models and tidy the outputs with broom::glance(). We now have the raw fit and the tidy output as list-column tibbles! We’re really just interested in r.squared for this example, so we’ll unnest() the data in the next step to get that out by metric.


tidy_off_models <- nest_off_data %>% 
    mutate(
      fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
      tidy_output = map(fit, glance)
    )

tidy_off_models

# A tibble: 12 x 4
   metric         data               fit    tidy_output      
   <chr>          <list>             <list> <list>           
 1 pass_att       <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
 2 pass_comp_pct  <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
 3 yds_att        <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
 4 pass_yds       <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
 5 pass_td        <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
 6 int            <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
 7 pass_rating    <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
 8 first_downs    <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
 9 pass_first_pct <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
10 pass_20plus    <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
11 pass_40plus    <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
12 sacks          <tibble [593 × 5]> <lm>   <tibble [1 × 12]>

Unnest the model metrics

Now we have a few options - we can use unnest_wider() to get ALL the model metrics, but again that’s overkill for our example.


tidy_off_models %>% 
    unnest_wider(tidy_output)

# A tibble: 12 x 15
   metric data  fit   r.squared adj.r.squared   sigma statistic
   <chr>  <lis> <lis>     <dbl>         <dbl>   <dbl>     <dbl>
 1 pass_… <tib… <lm>     0.181         0.180  5.36e+1     131. 
 2 pass_… <tib… <lm>     0.290         0.289  3.59e-2     242. 
 3 yds_a… <tib… <lm>     0.176         0.175  6.65e-1     127. 
 4 pass_… <tib… <lm>     0.306         0.305  4.93e+2     261. 
 5 pass_… <tib… <lm>     0.180         0.178  6.63e+0     130. 
 6 int    <tib… <lm>     0.0623        0.0607 4.81e+0      39.3
 7 pass_… <tib… <lm>     0.233         0.232  1.04e+1     179. 
 8 first… <tib… <lm>     0.308         0.307  2.55e+1     263. 
 9 pass_… <tib… <lm>     0.225         0.224  3.31e-2     172. 
10 pass_… <tib… <lm>     0.133         0.131  9.92e+0      90.3
11 pass_… <tib… <lm>     0.0518        0.0502 3.51e+0      32.3
12 sacks  <tib… <lm>     0.121         0.119  1.01e+1      81.1
# … with 8 more variables: p.value <dbl>, df <dbl>, logLik <dbl>,
#   AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

Instead we’ll use tidyr::hoist() which pulls specific columns from nested data. In this case, we are extracting just the r.squared values for each respective metric and then arranging by r.squared. Full details of unnest vs hoist can be found at tidyr site.


off_lm_output <- tidy_off_models %>% 
    hoist(tidy_output, r.squared = "r.squared") %>% 
    arrange(desc(r.squared))

off_lm_output

# A tibble: 12 x 5
   metric         data               fit    r.squared tidy_output     
   <chr>          <list>             <list>     <dbl> <list>          
 1 first_downs    <tibble [593 × 5]> <lm>      0.308  <tibble [1 × 11…
 2 pass_yds       <tibble [593 × 5]> <lm>      0.306  <tibble [1 × 11…
 3 pass_comp_pct  <tibble [593 × 5]> <lm>      0.290  <tibble [1 × 11…
 4 pass_rating    <tibble [593 × 5]> <lm>      0.233  <tibble [1 × 11…
 5 pass_first_pct <tibble [593 × 5]> <lm>      0.225  <tibble [1 × 11…
 6 pass_att       <tibble [593 × 5]> <lm>      0.181  <tibble [1 × 11…
 7 pass_td        <tibble [593 × 5]> <lm>      0.180  <tibble [1 × 11…
 8 yds_att        <tibble [593 × 5]> <lm>      0.176  <tibble [1 × 11…
 9 pass_20plus    <tibble [593 × 5]> <lm>      0.133  <tibble [1 × 11…
10 sacks          <tibble [593 × 5]> <lm>      0.121  <tibble [1 × 11…
11 int            <tibble [593 × 5]> <lm>      0.0623 <tibble [1 × 11…
12 pass_40plus    <tibble [593 × 5]> <lm>      0.0518 <tibble [1 × 11…

Get just the good bits

So we just want the r.squared and metric values, plus a label we can use for ggplot down the road. Boom we have the final output!


off_stability <- off_lm_output %>% 
  select(metric, r.squared) %>% 
  mutate(metric_label = glue::glue("{metric} (R^2 = {round(r.squared, 3)})"))

off_stability

# A tibble: 12 x 3
   metric         r.squared metric_label                
   <chr>              <dbl> <glue>                      
 1 first_downs       0.308  first_downs (R^2 = 0.308)   
 2 pass_yds          0.306  pass_yds (R^2 = 0.306)      
 3 pass_comp_pct     0.290  pass_comp_pct (R^2 = 0.29)  
 4 pass_rating       0.233  pass_rating (R^2 = 0.233)   
 5 pass_first_pct    0.225  pass_first_pct (R^2 = 0.225)
 6 pass_att          0.181  pass_att (R^2 = 0.181)      
 7 pass_td           0.180  pass_td (R^2 = 0.18)        
 8 yds_att           0.176  yds_att (R^2 = 0.176)       
 9 pass_20plus       0.133  pass_20plus (R^2 = 0.133)   
10 sacks             0.121  sacks (R^2 = 0.121)         
11 int               0.0623 int (R^2 = 0.062)           
12 pass_40plus       0.0518 pass_40plus (R^2 = 0.052)   

TLDR

Now that may have seemed like a lot of code, mainly because we broke down an example. Let’s look at it all together now. We rearranged the dataset, nested, ran 9 models, and got our outputs in one pipe with just a few lines of code!


(off_stability <- long_off_stats %>% 
    inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>% 
    select(everything(), -season2.y) %>% 
    nest(data = c(-metric)) %>% 
    mutate(
      fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
      glanced = map(fit, glance)
    ) %>% 
    hoist(glanced, r.squared = "r.squared") %>% 
    arrange(desc(r.squared)) %>% 
    mutate(metric_label = glue::glue("{metric} (R^2 = {round(r.squared, 3)})")))

Plot it

Now there’s another advantage to getting data into this longer format!

We can combine our labels that we generated with glue and the long-form data to plot ALL of the raw metrics at once with one ggplot. Note I have added an example line in light grey (slope = 1 for perfect fit) and a red-line for the lm for each dataset. That’s all for this example, but hopefully that opened your eyes to a nest-map-unnest workflow even for relatively simple models!

I’d definitely recommend trying out the rest of the tidymodels ecosystem for your more advanced machine learning and statistical analyses. You can learn all about it at the tidymodels.org site.


(off_plot <- long_off_stats %>% 
    inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>% 
    mutate(metric = factor(metric,
                           levels = pull(off_stability, metric),
                           labels = pull(off_stability, metric_label))) %>% 
    ggplot(aes(x = value.x, y = value.y)) +
    geom_point(color = "black", alpha = 0.5) +
    geom_smooth(method = "lm", color = "#ff2b4f", se = FALSE) +
   scale_x_continuous(breaks = scales::pretty_breaks(n = 5)) +
   scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
   geom_abline(intercept = 0, slope = 1, color = "grey", size = 1, alpha = 0.5) +
    facet_wrap(~metric, scales = "free") +
    labs(x = "\nYear Y Metric", 
         y = "Year Y + 1 Metric\n",
         title = "Offense Stats - 2000-2019",
         subtitle = "Linear model fit comparing Year and Year + 1",
         caption = "Plot: @thomas_mock | Data: espnscrapeR")  +
    theme_minimal() +
    theme(strip.background = element_rect(fill = "black"),
          strip.text = element_textbox(face = "bold", color = "white")))

# optional output
# ggsave("off_stability_plot.png", off_plot, height = 8, width = 10, dpi = 150)

Citation

For attribution, please cite this work as

Mock (2020, May 1). The Mockup Blog: Flipping tibbles for many models. Retrieved from https://themockup.blog/posts/2020-05-01-tidy-long-models/

BibTeX citation

@misc{mock2020flipping,
  author = {Mock, Thomas},
  title = {The Mockup Blog: Flipping tibbles for many models},
  url = {https://themockup.blog/posts/2020-05-01-tidy-long-models/},
  year = {2020}
}