Flipping tibbles for many models

Pivoting data from wide to long to run many models at once

Thomas Mock
05-01-2020

Table of Contents


Many Models

The tidymodels package advocates for a nest-map-unnest workflow for running many models at once. While this typically is used for some type of group as seen in the tidymodels docs, we can also do it for running many models at once from a wide dataset.

Our goal is to get all of the measures into a long form tibble so that we can fit all of the models at once, and plot it all at once.

Basic Example

This basic example is borrowed directly from the tidymodels docs.

First you nest the data by a grouping variable to get list-columns of each split data/tibble.


mtcars <- as_tibble(mtcars)  # to play nicely with list-cols

nest_mtcars <- mtcars %>%
  nest(data = c(-am)) 

nest_mtcars

# A tibble: 2 x 2
     am data              
  <dbl> <list>            
1     1 <tibble [13 × 10]>
2     0 <tibble [19 × 10]>

Now you can apply a lm() call w/ purrr::map() to each dataset, and then broom::tidy() the model output!


nest_mtcars %>% 
  mutate(
    fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)),  # S3 list-col
    tidied = map(fit, tidy)
  ) %>% 
  unnest(tidied) %>% 
  select(-data, -fit)

# A tibble: 8 x 6
     am term        estimate std.error statistic  p.value
  <dbl> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1     1 (Intercept)   4.28      3.46      1.24   0.247   
2     1 mpg          -0.101     0.0294   -3.43   0.00750 
3     1 qsec          0.0398    0.151     0.264  0.798   
4     1 gear         -0.0229    0.349    -0.0656 0.949   
5     0 (Intercept)   4.92      1.40      3.52   0.00309 
6     0 mpg          -0.192     0.0443   -4.33   0.000591
7     0 qsec          0.0919    0.0983    0.935  0.365   
8     0 gear          0.147     0.368     0.398  0.696   

Now each of the model metrics for automatic vs manual transmissions (0 vs 1) is easy to work with! We’ll use a similar approach (nest-map-unnest) albeit with a slightly different data structure for our following analysis.

NFL Example

We’ll be performing a similar analysis to what Josh Hermsmeyer’s did in his 538 article. The raw code for his analysis is available on his GitHub. Essentially he evaluated how stable metrics were year-over-year, by comparing:

We’re skipping most of the nflscrapR aggregation (see the link for full script), the portion we can change a bit to make our lives easier is the repeated modeling.

Rather than having to generate a model for every metric one-by-one, we can generate the models for ALL the metrics in the datase at once, while still running the model just for each metric by the following year’s metric.


### QB Hits model  ------------------------------------------------------------

qb_hits_model <- lm(data = team_defense_joined, qb_hit.y ~ qb_hit.x)

qb_hits_stability <- glance(qb_hits_model) %>%
   mutate(metric = "QB Hits",
          r.squared = round(r.squared, 3)) %>%
   select(metric, r.squared)

Get the data

So let’s load our libraries and get to modeling!


library(tidyverse) # all the things
library(espnscrapeR) # NFL data summaries
library(broom) # tidy modeling
library(glue) # nice string creation

First we’ll get all the data from the 2000-2019 seasons for NFL offenses via espnscrapeR::scrape_team_stats_nfl().


# Get data from espnscrapeR
all_off <- 2000:2019 %>% 
  map_dfr(scrape_team_stats_nfl)

glimpse(all_off)

Rows: 638
Columns: 25
$ rank            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ team            <chr> "St. Louis Rams", "Denver Broncos", "Indian…
$ games           <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,…
$ pts_game        <dbl> 33.8, 30.3, 26.8, 24.2, 24.8, 29.9, 22.9, 2…
$ pts_total       <int> 540, 485, 429, 388, 397, 479, 367, 355, 315…
$ plays_scrimmage <dbl> 1014, 1115, 1026, 1024, 958, 1023, 1080, 99…
$ yds_game        <dbl> 442.2, 409.6, 383.8, 377.5, 372.6, 361.0, 3…
$ yds_play        <dbl> 7.0, 5.9, 6.0, 5.9, 6.2, 5.6, 5.3, 5.6, 5.1…
$ first_down_g    <dbl> 23.8, 23.9, 22.3, 20.9, 19.9, 21.1, 19.9, 2…
$ third_conv      <int> 86, 97, 94, 84, 86, 89, 100, 75, 92, 97, 84…
$ third_att       <int> 181, 218, 201, 202, 188, 206, 235, 204, 247…
$ third_pct       <dbl> 0.48, 0.44, 0.47, 0.42, 0.46, 0.43, 0.43, 0…
$ fourth_conv     <int> 8, 9, 9, 10, 8, 3, 5, 4, 9, 10, 4, 11, 7, 4…
$ fourth_att      <int> 13, 17, 10, 20, 13, 8, 14, 13, 12, 17, 10, …
$ fourth_pct      <dbl> 0.62, 0.53, 0.90, 0.50, 0.62, 0.38, 0.36, 0…
$ penalty         <int> 111, 89, 89, 134, 106, 118, 95, 118, 101, 1…
$ penalty_yds     <dbl> 942, 792, 866, 1135, 908, 940, 703, 848, 91…
$ time_of_poss    <dbl> 30.90000, 33.25000, 29.55000, 29.95000, 29.…
$ fumbles_total   <int> 24, 26, 20, 22, 26, 18, 27, 23, 28, 26, 22,…
$ fumbles_lost    <int> 12, 13, 14, 9, 11, 9, 14, 11, 13, 11, 12, 1…
$ turnover_ratio  <int> -10, 19, -7, 2, -11, 17, 1, 3, 6, 9, 0, -5,…
$ stat            <chr> "GAME_STATS", "GAME_STATS", "GAME_STATS", "…
$ season          <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…
$ season_type     <chr> "REG", "REG", "REG", "REG", "REG", "REG", "…
$ role            <chr> "offense", "offense", "offense", "offense",…

Pivot the data longer

Because we are looking to run all the models at once, we’ll need to take the data structure from wide to longer, so we can nest() the datasets by metric and run the model on each metric pair. By pivoting to long format we get our team-level data by season and metric with the corresponding value of each season.


long_off_stats <- all_off %>% 
  select(team, pts_game:third_pct, season) %>% 
  mutate(season2 = season + 1) %>% 
  pivot_longer(
    cols = c(-team, -season, -season2), 
    names_to = "metric", 
    values_to = "value")

long_off_stats

# A tibble: 5,742 x 5
   team           season season2 metric            value
   <chr>           <int>   <dbl> <chr>             <dbl>
 1 St. Louis Rams   2000    2001 pts_game          33.8 
 2 St. Louis Rams   2000    2001 pts_total        540   
 3 St. Louis Rams   2000    2001 plays_scrimmage 1014   
 4 St. Louis Rams   2000    2001 yds_game         442.  
 5 St. Louis Rams   2000    2001 yds_play           7   
 6 St. Louis Rams   2000    2001 first_down_g      23.8 
 7 St. Louis Rams   2000    2001 third_conv        86   
 8 St. Louis Rams   2000    2001 third_att        181   
 9 St. Louis Rams   2000    2001 third_pct          0.48
10 Denver Broncos   2000    2001 pts_game          30.3 
# … with 5,732 more rows

Join the data

Next we need to join the data back into itself to get the matched value 1 (year) with value 2 (year + 1). The join renames value on the left-hand side (value.x) and the right-hand side (value.y). Technically we don’t need season or season 2 anymore, but I’ve kept them so we can do a quick visual check on the data. The numbers look good and are aligned properly!


join_years <- long_off_stats %>% 
    inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>% 
    select(everything(), -season2.y)

join_years

# A tibble: 5,436 x 6
   team           season season2 metric          value.x value.y
   <chr>           <int>   <dbl> <chr>             <dbl>   <dbl>
 1 St. Louis Rams   2000    2001 pts_game          33.8     31.4
 2 St. Louis Rams   2000    2001 pts_total        540      503  
 3 St. Louis Rams   2000    2001 plays_scrimmage 1014     1007  
 4 St. Louis Rams   2000    2001 yds_game         442.     418. 
 5 St. Louis Rams   2000    2001 yds_play           7        6.6
 6 St. Louis Rams   2000    2001 first_down_g      23.8     22.3
 7 St. Louis Rams   2000    2001 third_conv        86       96  
 8 St. Louis Rams   2000    2001 third_att        181      192  
 9 St. Louis Rams   2000    2001 third_pct          0.48     0.5
10 Denver Broncos   2000    2001 pts_game          30.3     21.2
# … with 5,426 more rows

Nest the Data

We now need to nest the data, leaving metric out so that it is used as the group/label data. Now each of the metrics are separated into their own respective nested datasets!


nest_off_data <- join_years %>% 
    nest(data = c(-metric))

nest_off_data 

# A tibble: 9 x 2
  metric          data              
  <chr>           <list>            
1 pts_game        <tibble [604 × 5]>
2 pts_total       <tibble [604 × 5]>
3 plays_scrimmage <tibble [604 × 5]>
4 yds_game        <tibble [604 × 5]>
5 yds_play        <tibble [604 × 5]>
6 first_down_g    <tibble [604 × 5]>
7 third_conv      <tibble [604 × 5]>
8 third_att       <tibble [604 × 5]>
9 third_pct       <tibble [604 × 5]>

Fit the models

Now let’s fit the models and tidy the outputs with broom::glance(). We now have the raw fit and the tidy output as list-column tibbles! We’re really just interested in r.squared for this example, so we’ll unnest() the data in the next step to get that out by metric.


tidy_off_models <- nest_off_data %>% 
    mutate(
      fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
      tidy_output = map(fit, glance)
    )

tidy_off_models

# A tibble: 9 x 4
  metric          data               fit    tidy_output      
  <chr>           <list>             <list> <list>           
1 pts_game        <tibble [604 × 5]> <lm>   <tibble [1 × 11]>
2 pts_total       <tibble [604 × 5]> <lm>   <tibble [1 × 11]>
3 plays_scrimmage <tibble [604 × 5]> <lm>   <tibble [1 × 11]>
4 yds_game        <tibble [604 × 5]> <lm>   <tibble [1 × 11]>
5 yds_play        <tibble [604 × 5]> <lm>   <tibble [1 × 11]>
6 first_down_g    <tibble [604 × 5]> <lm>   <tibble [1 × 11]>
7 third_conv      <tibble [604 × 5]> <lm>   <tibble [1 × 11]>
8 third_att       <tibble [604 × 5]> <lm>   <tibble [1 × 11]>
9 third_pct       <tibble [604 × 5]> <lm>   <tibble [1 × 11]>

Unnest the model metrics

Now we have a few options - we can use unnest_wider() to get ALL the model metrics, but again that’s overkill for our example.


tidy_off_models %>% 
    unnest_wider(tidy_output)

# A tibble: 9 x 14
  metric data  fit   r.squared adj.r.squared   sigma statistic
  <chr>  <lis> <lis>     <dbl>         <dbl>   <dbl>     <dbl>
1 pts_g… <tib… <lm>     0.185         0.184   3.98       137. 
2 pts_t… <tib… <lm>     0.185         0.184  63.6        137. 
3 plays… <tib… <lm>     0.160         0.159  42.8        115. 
4 yds_g… <tib… <lm>     0.253         0.252  33.5        204. 
5 yds_p… <tib… <lm>     0.192         0.191   0.456      143. 
6 first… <tib… <lm>     0.303         0.302   1.93       261. 
7 third… <tib… <lm>     0.0808        0.0792 11.1         52.9
8 third… <tib… <lm>     0.0613        0.0598 13.1         39.3
9 third… <tib… <lm>     0.179         0.177   0.0460     131. 
# … with 7 more variables: p.value <dbl>, df <int>, logLik <dbl>,
#   AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>

Instead we’ll use tidyr::hoist() which pulls specific columns from nested data. In this case, we are extracting just the r.squared values for each respective metric and then arranging by r.squared. Full details of unnest vs hoist can be found at tidyr site.


off_lm_output <- tidy_off_models %>% 
    hoist(tidy_output, r.squared = "r.squared") %>% 
    arrange(desc(r.squared))

off_lm_output

# A tibble: 9 x 5
  metric          data               fit    r.squared tidy_output     
  <chr>           <list>             <list>     <dbl> <list>          
1 first_down_g    <tibble [604 × 5]> <lm>      0.303  <tibble [1 × 10…
2 yds_game        <tibble [604 × 5]> <lm>      0.253  <tibble [1 × 10…
3 yds_play        <tibble [604 × 5]> <lm>      0.192  <tibble [1 × 10…
4 pts_game        <tibble [604 × 5]> <lm>      0.185  <tibble [1 × 10…
5 pts_total       <tibble [604 × 5]> <lm>      0.185  <tibble [1 × 10…
6 third_pct       <tibble [604 × 5]> <lm>      0.179  <tibble [1 × 10…
7 plays_scrimmage <tibble [604 × 5]> <lm>      0.160  <tibble [1 × 10…
8 third_conv      <tibble [604 × 5]> <lm>      0.0808 <tibble [1 × 10…
9 third_att       <tibble [604 × 5]> <lm>      0.0613 <tibble [1 × 10…

Get just the good bits

So we just want the r.squared and metric values, plus a label we can use for ggplot down the road. Boom we have the final output!


off_stability <- off_lm_output %>% 
  select(metric, r.squared) %>% 
  mutate(metric_label = glue::glue("{metric} (R^2 = {round(r.squared, 3)})"))

off_stability

# A tibble: 9 x 3
  metric          r.squared metric_label                
  <chr>               <dbl> <glue>                      
1 first_down_g       0.303  first_down_g (R^2 = 0.303)  
2 yds_game           0.253  yds_game (R^2 = 0.253)      
3 yds_play           0.192  yds_play (R^2 = 0.192)      
4 pts_game           0.185  pts_game (R^2 = 0.185)      
5 pts_total          0.185  pts_total (R^2 = 0.185)     
6 third_pct          0.179  third_pct (R^2 = 0.179)     
7 plays_scrimmage    0.160  plays_scrimmage (R^2 = 0.16)
8 third_conv         0.0808 third_conv (R^2 = 0.081)    
9 third_att          0.0613 third_att (R^2 = 0.061)     

TLDR

Now that may have seemed like a lot of code, mainly because we broke down an example. Let’s look at it all together now. We rearranged the dataset, nested, ran 9 models, and got our outputs in one pipe with just a few lines of code!


(off_stability <- long_off_stats %>% 
    inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>% 
    select(everything(), -season2.y) %>% 
    nest(data = c(-metric)) %>% 
    mutate(
      fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
      glanced = map(fit, glance)
    ) %>% 
    hoist(glanced, r.squared = "r.squared") %>% 
    arrange(desc(r.squared)) %>% 
    mutate(metric_label = glue::glue("{metric} (R^2 = {round(r.squared, 3)})")))

Plot it

Now there’s another advantage to getting data into this longer format!

We can combine our labels that we generated with glue and the long-form data to plot ALL of the raw metrics at once with one ggplot. Note I have added an example line in light grey (slope = 1 for perfect fit) and a red-line for the lm for each dataset. That’s all for this example, but hopefully that opened your eyes to a nest-map-unnest workflow even for relatively simple models!

I’d definitely recommend trying out the rest of the tidymodels ecosystem for your more advanced machine learning and statistical analyses. You can learn all about it at the tidymodels.org site.


(off_plot <- long_off_stats %>% 
    inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>% 
    mutate(metric = factor(metric,
                           levels = pull(off_stability, metric),
                           labels = pull(off_stability, metric_label))) %>% 
    ggplot(aes(x = value.x, y = value.y)) +
    geom_point(color = "black", alpha = 0.5) +
    geom_smooth(method = "lm", color = "#ff2b4f", se = FALSE) +
   scale_x_continuous(breaks = scales::pretty_breaks(n = 5)) +
   scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
   geom_abline(intercept = 0, slope = 1, color = "grey", size = 1, alpha = 0.5) +
    facet_wrap(~metric, scales = "free") +
    labs(x = "\nYear Y Metric", 
         y = "Year Y + 1 Metric\n",
         title = "Offense Stats - 2000-2019",
         subtitle = "Linear model fit comparing Year and Year + 1",
         caption = "Plot: @thomas_mock | Data: espnscrapeR")  +
    theme_minimal() +
    theme(strip.background = element_rect(fill = "black"),
          strip.text = element_text(face = "bold", color = "white")))