Pivoting data from wide to long to run many models at once

The `tidymodels`

package advocates for a nest-map-unnest workflow for running many models at once. While this typically is used for some type of group as seen in the `tidymodels`

docs, we can also do it for running many models at once from a wide dataset.

Our goal is to get all of the measures into a `long`

form tibble so that we can fit all of the models at once, and plot it all at once.

This basic example is borrowed directly from the `tidymodels`

docs.

First you nest the data by a grouping variable to get list-columns of each split data/tibble.

```
mtcars <- as_tibble(mtcars) # to play nicely with list-cols
nest_mtcars <- mtcars %>%
nest(data = c(-am))
nest_mtcars
```

```
# A tibble: 2 x 2
am data
<dbl> <list>
1 1 <tibble [13 × 10]>
2 0 <tibble [19 × 10]>
```

Now you can apply a `lm()`

call w/ `purrr::map()`

to each dataset, and then `broom::tidy()`

the model output!

```
nest_mtcars %>%
mutate(
fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)), # S3 list-col
tidied = map(fit, tidy)
) %>%
unnest(tidied) %>%
select(-data, -fit)
```

```
# A tibble: 8 x 6
am term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 4.28 3.46 1.24 0.247
2 1 mpg -0.101 0.0294 -3.43 0.00750
3 1 qsec 0.0398 0.151 0.264 0.798
4 1 gear -0.0229 0.349 -0.0656 0.949
5 0 (Intercept) 4.92 1.40 3.52 0.00309
6 0 mpg -0.192 0.0443 -4.33 0.000591
7 0 qsec 0.0919 0.0983 0.935 0.365
8 0 gear 0.147 0.368 0.398 0.696
```

Now each of the model metrics for automatic vs manual transmissions (0 vs 1) is easy to work with! We’ll use a similar approach (nest-map-unnest) albeit with a slightly different data structure for our following analysis.

We’ll be performing a similar analysis to what Josh Hermsmeyer’s did in his 538 article. The raw code for his analysis is available on his GitHub. Essentially he evaluated how stable metrics were year-over-year, by comparing:

`Metric in Year N`

`Metric in Year N + 1`

We’re skipping most of the `nflscrapR`

aggregation (see the link for full script), the portion we can change a bit to make our lives easier is the repeated modeling.

Rather than having to generate a model for every metric one-by-one, we can generate the models for ALL the metrics in the datase at once, while still running the model just for each metric by the following year’s metric.

```
### QB Hits model ------------------------------------------------------------
qb_hits_model <- lm(data = team_defense_joined, qb_hit.y ~ qb_hit.x)
qb_hits_stability <- glance(qb_hits_model) %>%
mutate(metric = "QB Hits",
r.squared = round(r.squared, 3)) %>%
select(metric, r.squared)
```

So let’s load our libraries and get to modeling!

```
library(tidyverse) # all the things
library(espnscrapeR) # NFL data summaries
library(broom) # tidy modeling
library(glue) # nice string creation
```

First we’ll get all the data from the 2000-2019 seasons for NFL offenses via `espnscrapeR::scrape_team_stats_nfl()`

.

```
# Get data from espnscrapeR
all_off <- 2000:2019 %>%
map_dfr(scrape_team_stats_nfl)
glimpse(all_off)
```

```
Rows: 638
Columns: 25
$ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ team <chr> "St. Louis Rams", "Denver Broncos", "Indian…
$ games <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,…
$ pts_game <dbl> 33.8, 30.3, 26.8, 24.2, 24.8, 29.9, 22.9, 2…
$ pts_total <int> 540, 485, 429, 388, 397, 479, 367, 355, 315…
$ plays_scrimmage <dbl> 1014, 1115, 1026, 1024, 958, 1023, 1080, 99…
$ yds_game <dbl> 442.2, 409.6, 383.8, 377.5, 372.6, 361.0, 3…
$ yds_play <dbl> 7.0, 5.9, 6.0, 5.9, 6.2, 5.6, 5.3, 5.6, 5.1…
$ first_down_g <dbl> 23.8, 23.9, 22.3, 20.9, 19.9, 21.1, 19.9, 2…
$ third_conv <int> 86, 97, 94, 84, 86, 89, 100, 75, 92, 97, 84…
$ third_att <int> 181, 218, 201, 202, 188, 206, 235, 204, 247…
$ third_pct <dbl> 0.48, 0.44, 0.47, 0.42, 0.46, 0.43, 0.43, 0…
$ fourth_conv <int> 8, 9, 9, 10, 8, 3, 5, 4, 9, 10, 4, 11, 7, 4…
$ fourth_att <int> 13, 17, 10, 20, 13, 8, 14, 13, 12, 17, 10, …
$ fourth_pct <dbl> 0.62, 0.53, 0.90, 0.50, 0.62, 0.38, 0.36, 0…
$ penalty <int> 111, 89, 89, 134, 106, 118, 95, 118, 101, 1…
$ penalty_yds <dbl> 942, 792, 866, 1135, 908, 940, 703, 848, 91…
$ time_of_poss <dbl> 30.90000, 33.25000, 29.55000, 29.95000, 29.…
$ fumbles_total <int> 24, 26, 20, 22, 26, 18, 27, 23, 28, 26, 22,…
$ fumbles_lost <int> 12, 13, 14, 9, 11, 9, 14, 11, 13, 11, 12, 1…
$ turnover_ratio <int> -10, 19, -7, 2, -11, 17, 1, 3, 6, 9, 0, -5,…
$ stat <chr> "GAME_STATS", "GAME_STATS", "GAME_STATS", "…
$ season <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…
$ season_type <chr> "REG", "REG", "REG", "REG", "REG", "REG", "…
$ role <chr> "offense", "offense", "offense", "offense",…
```

Because we are looking to run all the models at once, we’ll need to take the data structure from wide to longer, so we can `nest()`

the datasets by metric and run the model on each metric pair. By `pivot`

ing to long format we get our team-level data by season and metric with the corresponding value of each season.

```
long_off_stats <- all_off %>%
select(team, pts_game:third_pct, season) %>%
mutate(season2 = season + 1) %>%
pivot_longer(
cols = c(-team, -season, -season2),
names_to = "metric",
values_to = "value")
long_off_stats
```

```
# A tibble: 5,742 x 5
team season season2 metric value
<chr> <int> <dbl> <chr> <dbl>
1 St. Louis Rams 2000 2001 pts_game 33.8
2 St. Louis Rams 2000 2001 pts_total 540
3 St. Louis Rams 2000 2001 plays_scrimmage 1014
4 St. Louis Rams 2000 2001 yds_game 442.
5 St. Louis Rams 2000 2001 yds_play 7
6 St. Louis Rams 2000 2001 first_down_g 23.8
7 St. Louis Rams 2000 2001 third_conv 86
8 St. Louis Rams 2000 2001 third_att 181
9 St. Louis Rams 2000 2001 third_pct 0.48
10 Denver Broncos 2000 2001 pts_game 30.3
# … with 5,732 more rows
```

Next we need to join the data back into itself to get the matched value 1 (year) with value 2 (year + 1). The join renames `value`

on the left-hand side (`value.x`

) and the right-hand side (`value.y`

). Technically we don’t need season or season 2 anymore, but I’ve kept them so we can do a quick visual check on the data. The numbers look good and are aligned properly!

```
join_years <- long_off_stats %>%
inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>%
select(everything(), -season2.y)
join_years
```

```
# A tibble: 5,436 x 6
team season season2 metric value.x value.y
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 St. Louis Rams 2000 2001 pts_game 33.8 31.4
2 St. Louis Rams 2000 2001 pts_total 540 503
3 St. Louis Rams 2000 2001 plays_scrimmage 1014 1007
4 St. Louis Rams 2000 2001 yds_game 442. 418.
5 St. Louis Rams 2000 2001 yds_play 7 6.6
6 St. Louis Rams 2000 2001 first_down_g 23.8 22.3
7 St. Louis Rams 2000 2001 third_conv 86 96
8 St. Louis Rams 2000 2001 third_att 181 192
9 St. Louis Rams 2000 2001 third_pct 0.48 0.5
10 Denver Broncos 2000 2001 pts_game 30.3 21.2
# … with 5,426 more rows
```

We now need to nest the data, leaving `metric`

out so that it is used as the group/label data. Now each of the metrics are separated into their own respective nested datasets!

```
nest_off_data <- join_years %>%
nest(data = c(-metric))
nest_off_data
```

```
# A tibble: 9 x 2
metric data
<chr> <list>
1 pts_game <tibble [604 × 5]>
2 pts_total <tibble [604 × 5]>
3 plays_scrimmage <tibble [604 × 5]>
4 yds_game <tibble [604 × 5]>
5 yds_play <tibble [604 × 5]>
6 first_down_g <tibble [604 × 5]>
7 third_conv <tibble [604 × 5]>
8 third_att <tibble [604 × 5]>
9 third_pct <tibble [604 × 5]>
```

Now let’s `fit`

the models and tidy the outputs with `broom::glance()`

. We now have the raw fit and the tidy output as list-column tibbles! We’re really just interested in r.squared for this example, so we’ll `unnest()`

the data in the next step to get that out by metric.

```
tidy_off_models <- nest_off_data %>%
mutate(
fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
tidy_output = map(fit, glance)
)
tidy_off_models
```

```
# A tibble: 9 x 4
metric data fit tidy_output
<chr> <list> <list> <list>
1 pts_game <tibble [604 × 5]> <lm> <tibble [1 × 11]>
2 pts_total <tibble [604 × 5]> <lm> <tibble [1 × 11]>
3 plays_scrimmage <tibble [604 × 5]> <lm> <tibble [1 × 11]>
4 yds_game <tibble [604 × 5]> <lm> <tibble [1 × 11]>
5 yds_play <tibble [604 × 5]> <lm> <tibble [1 × 11]>
6 first_down_g <tibble [604 × 5]> <lm> <tibble [1 × 11]>
7 third_conv <tibble [604 × 5]> <lm> <tibble [1 × 11]>
8 third_att <tibble [604 × 5]> <lm> <tibble [1 × 11]>
9 third_pct <tibble [604 × 5]> <lm> <tibble [1 × 11]>
```

Now we have a few options - we can use `unnest_wider()`

to get ALL the model metrics, but again that’s overkill for our example.

```
tidy_off_models %>%
unnest_wider(tidy_output)
```

```
# A tibble: 9 x 14
metric data fit r.squared adj.r.squared sigma statistic
<chr> <lis> <lis> <dbl> <dbl> <dbl> <dbl>
1 pts_g… <tib… <lm> 0.185 0.184 3.98 137.
2 pts_t… <tib… <lm> 0.185 0.184 63.6 137.
3 plays… <tib… <lm> 0.160 0.159 42.8 115.
4 yds_g… <tib… <lm> 0.253 0.252 33.5 204.
5 yds_p… <tib… <lm> 0.192 0.191 0.456 143.
6 first… <tib… <lm> 0.303 0.302 1.93 261.
7 third… <tib… <lm> 0.0808 0.0792 11.1 52.9
8 third… <tib… <lm> 0.0613 0.0598 13.1 39.3
9 third… <tib… <lm> 0.179 0.177 0.0460 131.
# … with 7 more variables: p.value <dbl>, df <int>, logLik <dbl>,
# AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
```

Instead we’ll use `tidyr::hoist()`

which pulls specific columns from nested data. In this case, we are extracting just the r.squared values for each respective metric and then arranging by r.squared. Full details of `unnest`

vs `hoist`

can be found at `tidyr`

site.

```
off_lm_output <- tidy_off_models %>%
hoist(tidy_output, r.squared = "r.squared") %>%
arrange(desc(r.squared))
off_lm_output
```

```
# A tibble: 9 x 5
metric data fit r.squared tidy_output
<chr> <list> <list> <dbl> <list>
1 first_down_g <tibble [604 × 5]> <lm> 0.303 <tibble [1 × 10…
2 yds_game <tibble [604 × 5]> <lm> 0.253 <tibble [1 × 10…
3 yds_play <tibble [604 × 5]> <lm> 0.192 <tibble [1 × 10…
4 pts_game <tibble [604 × 5]> <lm> 0.185 <tibble [1 × 10…
5 pts_total <tibble [604 × 5]> <lm> 0.185 <tibble [1 × 10…
6 third_pct <tibble [604 × 5]> <lm> 0.179 <tibble [1 × 10…
7 plays_scrimmage <tibble [604 × 5]> <lm> 0.160 <tibble [1 × 10…
8 third_conv <tibble [604 × 5]> <lm> 0.0808 <tibble [1 × 10…
9 third_att <tibble [604 × 5]> <lm> 0.0613 <tibble [1 × 10…
```

So we just want the r.squared and metric values, plus a label we can use for `ggplot`

down the road. Boom we have the final output!

```
off_stability <- off_lm_output %>%
select(metric, r.squared) %>%
mutate(metric_label = glue::glue("{metric} (R^2 = {round(r.squared, 3)})"))
off_stability
```

```
# A tibble: 9 x 3
metric r.squared metric_label
<chr> <dbl> <glue>
1 first_down_g 0.303 first_down_g (R^2 = 0.303)
2 yds_game 0.253 yds_game (R^2 = 0.253)
3 yds_play 0.192 yds_play (R^2 = 0.192)
4 pts_game 0.185 pts_game (R^2 = 0.185)
5 pts_total 0.185 pts_total (R^2 = 0.185)
6 third_pct 0.179 third_pct (R^2 = 0.179)
7 plays_scrimmage 0.160 plays_scrimmage (R^2 = 0.16)
8 third_conv 0.0808 third_conv (R^2 = 0.081)
9 third_att 0.0613 third_att (R^2 = 0.061)
```

Now that may have seemed like a lot of code, mainly because we broke down an example. Let’s look at it all together now. We rearranged the dataset, nested, ran 9 models, and got our outputs in one pipe with just a few lines of code!

```
(off_stability <- long_off_stats %>%
inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>%
select(everything(), -season2.y) %>%
nest(data = c(-metric)) %>%
mutate(
fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
glanced = map(fit, glance)
) %>%
hoist(glanced, r.squared = "r.squared") %>%
arrange(desc(r.squared)) %>%
mutate(metric_label = glue::glue("{metric} (R^2 = {round(r.squared, 3)})")))
```

Now there’s another advantage to getting data into this longer format!

We can combine our labels that we generated with `glue`

and the long-form data to plot ALL of the raw metrics at once with one `ggplot`

. Note I have added an example line in light grey (slope = 1 for perfect fit) and a red-line for the `lm`

for each dataset. That’s all for this example, but hopefully that opened your eyes to a nest-map-unnest workflow even for relatively simple models!

I’d definitely recommend trying out the rest of the `tidymodels`

ecosystem for your more advanced machine learning and statistical analyses. You can learn all about it at the tidymodels.org site.

```
(off_plot <- long_off_stats %>%
inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>%
mutate(metric = factor(metric,
levels = pull(off_stability, metric),
labels = pull(off_stability, metric_label))) %>%
ggplot(aes(x = value.x, y = value.y)) +
geom_point(color = "black", alpha = 0.5) +
geom_smooth(method = "lm", color = "#ff2b4f", se = FALSE) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 5)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
geom_abline(intercept = 0, slope = 1, color = "grey", size = 1, alpha = 0.5) +
facet_wrap(~metric, scales = "free") +
labs(x = "\nYear Y Metric",
y = "Year Y + 1 Metric\n",
title = "Offense Stats - 2000-2019",
subtitle = "Linear model fit comparing Year and Year + 1",
caption = "Plot: @thomas_mock | Data: espnscrapeR") +
theme_minimal() +
theme(strip.background = element_rect(fill = "black"),
strip.text = element_text(face = "bold", color = "white")))
```