Heatmaps in ggplot2

data visualization ggplot2 tidyverse NFL

It’s more than just a passing fad.

Thomas Mock
08-28-2020

Many thanks to Ethan Douglas for sharing his heatmap python code on OpenSource Football! 1 This is a similar walkthrough to Ethan’s post, but in R + ggplot2. Additionally, credit for both collecting the data and the original plot go to Ethan.

Please also note that the original code adapted from Ethan came from Sarah Mallepalle et al, 2019. 2 Many thanks to Sarah, Ron, Konstantinos, and Sam for sharing this data collection tool!

Load Everything


library(tidyverse) # all the things
library(ggExtra)   # marginal plots
library(ggtext)    # color your text
library(patchwork) # combine multiple plots
library(paletteer) # get all the color palettes
library(scales)    # helper functions from ggplot2

souce_url <- "https://raw.githubusercontent.com/ArrowheadAnalytics/next-gen-scrapy-2.0/master/pass_and_game_data.csv"

pass_map_df <- read_csv(souce_url) %>%
  na.omit() %>%
  select(-X1)

glimpse(pass_map_df)

Rows: 49,758
Columns: 23
$ game_id     <dbl> 2017091004, 2017091004, 2017091004, 2017091004,…
$ team        <chr> "ARI", "ARI", "ARI", "ARI", "ARI", "ARI", "ARI"…
$ week.x      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ name        <chr> "Carson Palmer", "Carson Palmer", "Carson Palme…
$ pass_type   <chr> "COMPLETE", "COMPLETE", "COMPLETE", "COMPLETE",…
$ x_coord     <dbl> 13.5, -13.7, 2.2, 23.9, -23.5, -4.4, 8.0, -19.0…
$ y_coord     <dbl> -2.8, -4.4, 7.1, 10.0, 14.6, 5.3, 5.2, 18.0, 8.…
$ game_id.y   <chr> "2017_01_ARI_DET", "2017_01_ARI_DET", "2017_01_…
$ season      <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,…
$ game_type   <chr> "REG", "REG", "REG", "REG", "REG", "REG", "REG"…
$ week.y      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ gameday     <date> 2017-09-10, 2017-09-10, 2017-09-10, 2017-09-10…
$ weekday     <chr> "Sunday", "Sunday", "Sunday", "Sunday", "Sunday…
$ gametime    <time> 13:00:00, 13:00:00, 13:00:00, 13:00:00, 13:00:…
$ away_team   <chr> "ARI", "ARI", "ARI", "ARI", "ARI", "ARI", "ARI"…
$ home_team   <chr> "DET", "DET", "DET", "DET", "DET", "DET", "DET"…
$ away_score  <dbl> 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,…
$ home_score  <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35,…
$ home_result <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
$ stadium     <chr> "Ford Field", "Ford Field", "Ford Field", "Ford…
$ location    <chr> "Home", "Home", "Home", "Home", "Home", "Home",…
$ roof        <chr> "dome", "dome", "dome", "dome", "dome", "dome",…
$ surface     <chr> "fieldturf", "fieldturf", "fieldturf", "fieldtu…

Why not just plot the points?

Why even mess around with heatmaps or 2d density plots? Well, we run the risk of overplotting by graphing just the points without thinking or adjusting any of the aesthetics.

Hadley Wickham in ggplot2: Elegant Graphics for Data Analysis:

When the data is large, points will be often plotted on top of each other, obscuring the true relationship. In extreme cases, you will only be able to see the extent of the data, and any conclusions drawn from the graphic will be suspect. This problem is called overplotting.

Hadley uses an example with just 2000 data points, which already has a lot of overplotting due to a small sample x-y space. We have about 43,000 passes for this dataset, spread out over a relatively large space. We still need to be careful of overplotting though!

Claus Wilke | Fundamentals of Data Visualization3

While it’s not really focused on the problem at hand, Claus Wilke has a wonderful book on the Fundaments of Data Visualization. His chapters below are applicable and very helpful for best practices:

Main idea here is potentially overlapping plots to see where differences arise. Code adapted from Claus Wilke’s source code. 4


pass_map_df %>% 
  filter(str_detect(name, "Derek Carr|Mahomes")) %>% 
  mutate(name = factor(name)) %>% 
  ggplot() +
  geom_density(aes(x = y_coord, y = ..count.., fill = name, color = name),
               alpha = 0.5) +
  geom_vline(xintercept = 5, color = "red", size = 1, alpha = 0.5) +
  coord_cartesian(xlim = c(-10, 40)) +
  scale_x_continuous(breaks = seq(-10, 40, 5), name = "Yardline") +
  scale_y_continuous(breaks = seq(0, 100, 10), name = "scaled density") +
  scale_fill_manual(values = c("#0072B2", "#D55E00"), name = "QB") +
  scale_color_manual(values = scales::muted(c("#0072B2", "#D55E00")), name = "QB") +
  guides(fill = guide_legend(override.aes = list(linetype = 0))) +
  theme_minimal() +
  theme(
    axis.line.x = element_blank(),
    legend.position = c(.9, .87),
    legend.justification = c("right", "top"),
    legend.box.background = element_rect(fill = "white", color = "white"),
    plot.margin = margin(3, 7, 3, 1.5)
  )

Main idea here is overlapping many distributions at once, to see general trends or differences.

ggridges has a vignette on now to use it, but here’s a minimal example with the data at hand. Note that we are doing 1 dimensional density, so we could use this with nflfastR data as well!


library(ggridges)

pass_map_df %>% 
  filter(str_detect(name, "Wilson|Mahomes|Prescott|Watson|Rodgers")) %>% 
  mutate(name = word(name, 2)) %>% 
  ggplot(aes(x = y_coord, group = name, y = name)) +
  stat_density_ridges(quantile_lines = TRUE, quantiles = 2, scale = 2) +
  theme_ridges() +
  scale_x_continuous(breaks = seq(-10, 60, 10), name = "Yardline") +
  labs(y = "")

We create some bins and then plot the data for Mahomes. Here the data is relatively small, but still stacks.


bin_data <- pass_map_df %>% 
  filter(str_detect(name, "Mahomes"), between(y_coord, -5, 20)) %>% 
  # create 5 yard bins on y axis
  mutate(y_bins = y_coord - y_coord %% 5) 

bin_data %>% 
  ggplot(aes(x = x_coord, y = y_bins)) +
  geom_point() +
  scale_y_continuous(breaks = seq(-5, 20, 5))

We can add transparency with alpha along with geom_jitter() to add some random noise in the y-direction w/ height. Note that this is kind of a silly example since we have x and y coords, but you can imagine a situation where you only have binned data or like nflfastR you have large bins (left, middle right and short, long).


bin_data %>% 
  ggplot(aes(x = x_coord, y = y_bins)) +
  geom_jitter(height = 0.3, alpha = 0.5) +
  scale_y_continuous(breaks = seq(-5, 20, 5))

Claus then goes on to cover details with geom_bin2d(), geom_hex(), and geom_contour(), still worth reading but stuff we’ve covered in the bulk of this post. Again, Claus’ book is amazing, and a great resource regardless of what software you are plotting in. He used R, but intentionally left out code and focused on concepts.

Fear the beard

We can display our points with a quick geom_point() call but this returns a lot of overplotting albeit in a happy beard shape!


pass_map_df %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_point()

We can add 90% transparency with alpha = 0.1 or set alpha as uirea ratio, and change the color to red, but this still leaves us with difficulty parsing through the high vs medium pass locations. Although the 1/20 ratio gets us a bit farther. The ratio denominator can be though of as the “number of points that must overplotted to give a solid color.” 5

Passes around the 5 yard mark appear to be as common as at the 10 and even 20 yard mark in some cases! We know this isn’t true from nflfastR data. Maybe we can try to get closer views of the actual distributions with ggMarginal from the ggExtra R package, which provides marginal histograms, box plots, or density plots.


red_beard <- pass_map_df %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_point(alpha = 0.1, color = "red") +
  labs(title = "Alpha = 0.1 or 90% transparency")

red_beard_ratio <- pass_map_df %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_point(alpha = 1/20, color = "red") +
  labs(title = "Alpha = 1/20, so 20 points must be stacked to give solid color")


red_beard + red_beard_ratio

We can add some marginal plots to help out a bit, but that still doesn’t solve the problem of honing in on specific areas of interest that well.


# add density or histogram in the margins
ggMarginal(red_beard, type = "density", fill = "red")


# add density or histogram in the margins
ggMarginal(red_beard, type = "histogram", fill = "red")

2D Density Alternatives

Since we’re actually interested in the density/counts of observations at each x-y coordinate instead of plotting each individual point, we also have a few other tools in our toolbox! We can use geom_hex() or geom_bin2d(). Note that there are a few other methods such as geom_raster() which require you to provide your own “z” metric, and there are stat_? versions of each of these geoms for more custom calculations.


hex_plot <- pass_map_df %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_hex(
    binwidth = c(1, 1)
  ) +
  scale_fill_gradient(low = "red", high = "yellow") +
  geom_hline(yintercept = c(2, 7), color = "grey") +
  scale_y_continuous(breaks = seq(-10, 60, 5))

hex_plot


bin2d <- pass_map_df %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_bin2d(
    binwidth = c(1, 1)
  ) +
  scale_fill_gradient(low = "red", high = "yellow") +
  geom_hline(yintercept = c(2, 7), color = "grey") +
  scale_y_continuous(breaks = seq(-10, 60, 5))

bin2d

We can now see that while there ARE passes all over the field, the majority are around the 2-7 yard mark with small but relatively dense pockets elsewhere. Passes are most frequently thrown at 4, 5 or 6 yards and ~68% of passes are thrown 10 yards or shorter.


# histogram of just Y coord
pass_map_df %>%
  ggplot(aes(x = y_coord)) +
  geom_histogram(binwidth = 1) +
  geom_vline(xintercept = c(5)) +
  scale_x_continuous(breaks = seq(-10, 60, 5))


# calc some percentages
pass_by_y <- pass_map_df %>%
  mutate(y_rnd = round(y_coord, digits = 0)) %>%
  count(y_rnd) %>%
  mutate(
    total = sum(n),
    pct_total = n / total,
    roll_total = cumsum(pct_total)
  )

# ~35.6% of passes between 2-7 yards
pass_by_y %>%
  filter(between(y_rnd, 2, 7)) %>%
  summarize(pct = sum(pct_total))

# A tibble: 1 x 1
    pct
  <dbl>
1 0.351

# passes are most commonly thrown at 4-6 yards
# 68% of passes are thrown 10 yards or shorter
pass_by_y %>%
  arrange(desc(pct_total))

# A tibble: 68 x 5
   y_rnd     n total pct_total roll_total
   <dbl> <int> <int>     <dbl>      <dbl>
 1     4  3705 49758    0.0745      0.408
 2     5  3420 49758    0.0687      0.477
 3     6  3248 49758    0.0653      0.542
 4     2  2774 49758    0.0557      0.282
 5     3  2550 49758    0.0512      0.333
 6     1  1894 49758    0.0381      0.226
 7     0  1775 49758    0.0357      0.188
 8     7  1771 49758    0.0356      0.578
 9     8  1736 49758    0.0349      0.612
10    10  1642 49758    0.0330      0.672
# … with 58 more rows

Don’t bury the lede?

Another major advantage of geom_hex or geom_bin2d() is they’re remarkably faster for big data than plotting geom_point() along with reducing the likelihood of overplotting! For a toy example of about 1.7 million points, geom_hex() executes in about 2 sec vs 20 sec with geom_point(), and then subsequent 30-60 sec to “draw” the output in the viewer of R/RStudio.

2D Density of “smaller” data

So we’ve covered hex and rectangular 2d bins. To me, these are not as attractive for “small data” like we may see for individual QB plots. Let’s take Patrick Mahomes for example, he only has ~1,000 passes in this dataset. We can plot with a geom_hex() for each 1x1 yard chunk like we did before, but I honestly have trouble determining trends of where he likes to throw with that graphic.


pass_map_df %>%
  filter(str_detect(name, c("Mahomes"))) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_hex(binwidth = c(1, 1)) +
  scale_y_continuous(breaks = seq(-10, 60, 5))

We could always use larger bins, but if we’re just trying to see large trends we have another strategy in geom_density_2d().

Perform a 2D kernel density estimation using MASS::kde2d() and display the results with contours. This can be useful for dealing with overplotting. This is a 2D version of geom_density(). geom_density_2d() draws contour lines, and geom_density_2d_filled() draws filled contour bands.

This essentially fits a polygon around the most frequent points by x/y coordinates, and then colors them according to density. In this case, x/y is basically the position on the field and “z” is the density of how many plots were thrown to that area.


pass_map_df %>%
  filter(str_detect(name, c("Mahomes"))) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_density_2d_filled() +
  scale_y_continuous(breaks = seq(-10, 60, 5))

Now if we want to get even more clever, we can use this compare passing heatmaps of specific QBs. We can normalize across the facets, and drop the least frequent passes with specific breaks. I’ve binned into 10 specific breakpoints, and by setting breaks between 0.1 and 1.0, we also drop the very least frequent passes which otherwise show up as a light fill of the entire remaining plot. Additionally, by adding a horizontal reference line we can pretty clearly see that Carr’s most common passes are behind the 5 yard line, while Mahomes has passed beyond the 5 yard line much more frequently.


pass_map_df %>%
  filter(str_detect(name, c("Mahomes|Derek Carr"))) %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_density_2d_filled(
    aes(fill = ..level..),
    contour_var = "ndensity", # normalize to each QBs total passes
    breaks = seq(0.1, 1.0, length.out = 10) # drop the lowest passes
  ) +
  scale_y_continuous(breaks = seq(-10, 60, 5)) +
  facet_wrap(~name) +
  geom_hline(yintercept = 5)

This is all well and good, but it’s not the prettiest piece of dataviz.

Thanks again to Ethan Douglas and his post on OpenSourceFootball.com 6, we have a good framework about how to approach building a NFL field as a graph.

Build the field


not_div_5 <- function(x) {
  # select only elements of the vector not divisible by 5
  x[x %% 5 != 0]
}

center_df <- tibble(
  x_coord = c(rep(-3.1, 60), rep(3.1, 60)),
  y_coord = seq(-14, 59, 1) %>% rep(2) %>% not_div_5(),
  text = "--"
)

# line labels
annotate_df <- tibble(
  x_coord = c(12.88, -12.88) %>% rep(each = 5),
  y_coord = seq(10, 50, 10) %>% rep(2),
  text = seq(10, 50, 10) %>% rep(2) %>% str_replace("(.)(.)", "\\1 \\2"),
  rotation = c(90, 270) %>% rep(each = 5)
)

# yardlines
yardline_df <- tibble(
  y = seq(-15, 60, 5),
  yend = seq(-15, 60, 5),
  x = rep(-56 / 2, 16),
  xend = rep(56 / 2, 16)
)

# sidelines
sideline_df <- tibble(
  y = c(-15.15, -15.15),
  yend = c(60.15, 60.15),
  x = c(-56 / 2, 56 / 2),
  xend = c(-56 / 2, 56 / 2)
)

After building some datasets, we can now plot just the field without any other data. This looks nice enough, but I don’t want to copy paste it for all the remaining examples, so let’s try putting it into a function that we can call with + add_field()!


ggplot(data = NULL, aes(x = x_coord, y = y_coord)) +
  coord_cartesian(
    xlim = c(-53.333 / 2, 53.333 / 2),
    ylim = c(-15, 60)
  ) +
  geom_text(
    data = annotate_df, aes(label = text, angle = rotation),
    color = "black", size = 8
  ) +
  geom_segment(
    data = yardline_df, color = "black", size = 1,
    aes(x = x, y = y, xend = xend, yend = yend)
  ) +
  geom_segment(
    x = -56 / 2, y = 0, xend = 56 / 2, yend = 0,
    color = "blue", size = 1, alpha = 0.5
  ) +
  geom_segment(
    data = sideline_df, color = "black", size = 2,
    aes(x = x, y = y, xend = xend, yend = yend)
  ) +
  geom_text(
    data = center_df,
    aes(label = text), color = "black", vjust = 0.32
  ) +
  theme_void()

Wrap the field in a function

We can turn this into a function so that we don’t have to copy-paste it all over the place. Note that I’m wrapping it in a list so I can use it with + in a sequence of ggplot2 calls.

I’m also using front_col and back_col to let us switch from white on black to black on white if needed.


add_field <- function() {
  list(
    coord_cartesian(
      xlim = c(-53.333 / 2, 53.333 / 2),
      ylim = c(-15, 60)
    ),
    geom_text(
      data = annotate_df, aes(label = text, angle = rotation),
      color = front_col, size = 8
    ),
    geom_segment(
      data = yardline_df, color = front_col, size = 1,
      aes(x = x, y = y, xend = xend, yend = yend)
    ),
    geom_segment(
      x = -56 / 2, y = 0, xend = 56 / 2, yend = 0,
      color = "blue", size = 1, alpha = 0.5
    ),
    geom_segment(
      data = sideline_df, color = front_col, size = 2,
      aes(x = x, y = y, xend = xend, yend = yend)
    ),
    geom_text(
      data = center_df,
      aes(label = text), color = front_col, vjust = 0.32
    ),
    theme_void(),
    theme(
      strip.text = element_text(size = 20, color = front_col),
      plot.background = element_rect(fill = back_col, color = NA),
      legend.position = "none",
      plot.margin = unit(c(2, 1, 0.5, 1), unit = "cm"),
      plot.caption = element_text(color = front_col),
      plot.title = element_text(color = front_col),
      plot.subtitle = element_text(color = front_col),
      panel.background = element_rect(fill = back_col, color = NA),
      panel.border = element_blank()
    )
  )
}

QB Comparison

First let’s filter down to just our two QBs to compare, Patrick Mahomes and Russell Wilson.


passer_df <- pass_map_df %>%
  filter(str_detect(name, c("Mahomes|Russell"))) %>%
  mutate(name = factor(name, levels = c("Patrick Mahomes", "Russell Wilson"))) %>%
  select(name, x_coord, y_coord)

passer_df %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_density2d_filled() +
  theme(legend.position = "none")

We’ll specify fill and color to both scale with the level/density and normal density peaks across our plots, and finally set our breaks to drop the lowest bin of passes.


pass_map <- passer_df %>%
  ggplot(aes(x = x_coord, y = y_coord)) +
  geom_density_2d_filled(
    aes(fill = ..level.., color = ..level..),
    contour_var = "ndensity", # normalize across facets
    breaks = seq(0.1, 1.0, length.out = 10)
  ) +
  facet_wrap(~name)

pass_map

We can quickly add the field background to this with our function add_field()!


back_col <- "white"
front_col <- "black"

pass_map +
  add_field()

Specify Color schemes

While that’s essentially our final graph perhaps you don’t want to use viridis which is the default color scheme. We can generate custom color palettes or use a pre-built color palette via the paleteer R package. Note that the 3 color palettes I create all do essentially the same thing but it’s:


heat_colors <- grDevices::colorRampPalette(c("#800026FF", "#FC4E2AFF", "#FEB24CFF", "#FFFFCCFF"))(10)

heat_palette <- paletteer::paletteer_d("RColorBrewer::YlOrRd", n = 9, direction = -1)

heat_colors_interpolated <- colorRampPalette(paletteer::paletteer_d("RColorBrewer::YlOrRd", n = 9, direction = -1))(10)

heat_colors %>% scales::show_col()


heat_palette %>% scales::show_col()


heat_colors_interpolated %>% scales::show_col()


pass_map +
  add_field() +
  scale_fill_manual(values = c(heat_colors_interpolated), aesthetics = c("fill", "color"))

And also in black!


back_col <- "black"
front_col <- "white"

pass_map +
  add_field() +
  scale_fill_manual(values = c(heat_colors_interpolated), aesthetics = c("fill", "color"))

PFF_Moo style Field

PFF_Moo takes a different approach to field lines, which can be recreated below. 7


not_div_5 <- function(x) {
  # select only elements of the vector not divisible by 5
  x[x %% 5 != 0]
}

center_df <- tibble(
  x_coord = c(rep(-3.1, 60), rep(3.1, 60)),
  y_coord = seq(-14, 59, 1) %>% rep(2) %>% not_div_5(),
  text = "--"
)

# line labels
horiz_yd_df <- tibble(
  x_coord = c(12.88, -12.88) %>% rep(each = 14),
  y_coord = seq(-10, 55, 5) %>% rep(2),
  text = seq(-10, 55, 5) %>% rep(2)
)

# yardlines
yardline_df <- tibble(
  y = seq(-15, 60, 5),
  yend = seq(-15, 60, 5),
  x = rep(-56 / 2, 16),
  xend = rep(56 / 2, 16)
)

# sidelines
sideline_df <- tibble(
  y = c(-15.15, -15.15),
  yend = c(60.15, 60.15),
  x = c(-56.5 / 2, 56.5 / 2),
  xend = c(-56.5 / 2, 56.5 / 2)
)

add_moo_field <- function() {
  list(
    coord_cartesian(
      xlim = c(-53.333/2, 53.333/2),
      ylim = c(-15, 60)
    ),
    geom_segment(
      data = yardline_df, color = front_col, size = 0.5,
      linetype = "dashed", alpha = 0.5,
      aes(x = x, y = y, xend = xend, yend = yend)
    ),
    geom_segment(
      aes(x = -56 / 2, y = 0, xend = 56 / 2, yend = 0),
      color = "blue", size = 1
    ),
    geom_segment(
      data = sideline_df, color = front_col, size = 2,
      aes(x = x, y = y, xend = xend, yend = yend)
    ),
    geom_text(
      data = center_df,
      aes(label = text), color = front_col, vjust = 0.32
    ),
    geom_text(
      data = horiz_yd_df, aes(label = text),
      color = front_col, size = 4, fontface = "bold"
    ),
    theme_void(),
    theme(
      strip.text = element_text(size = 20, color = front_col),
      plot.background = element_rect(fill = back_col, color = NA),
      legend.position = "none",
      plot.margin = unit(c(2, 1, 0.5, 1), unit = "cm"),
      plot.caption = element_text(color = front_col),
      plot.title = element_text(color = front_col),
      plot.subtitle = element_text(color = front_col),
      panel.background = element_rect(fill = back_col, color = NA),
      panel.border = element_blank()
    )
  )
}

back_col <- "white"
front_col <- "black"


ggplot(pass_map_df, aes(x = x_coord, y = y_coord)) +
  geom_density_2d_filled(
    aes(fill = ..level..),
    contour_var = "ndensity", # normalize to each QBs total passes
    breaks = seq(0.1, 1.0, length.out = 10) # drop the lowest passes
  ) +
  add_moo_field()

Get the raw density estimates

Lastly, we can also get the raw density measures to use against the points for example or for other more computational measures as opposed to just graphics. 8 Note that this is for ALL the QBs at once.


get_density <- function(x, y, ...) {
  density_out <- MASS::kde2d(x, y, ...)
  int_x <- findInterval(x, density_out$x)
  int_y <- findInterval(y, density_out$y)
  comb_int <- cbind(int_x, int_y)
  return(density_out$z[comb_int])
}

density_map <- pass_map_df %>% 
  select(x_coord, y_coord, name) %>% 
  # function works in dplyr!
  mutate(density = get_density(x_coord, y_coord, n = 100))

density_map %>% 
  ggplot(aes(x = x_coord, y = y_coord, color = density)) +
  geom_point(alpha = 0.2) +
  scale_color_gradient(low = "red", high = "yellow")

Compare 2D Density

Now for this, we’re actually creating two separate 2 dimensional density estimates, and then subtracting the density of QB2’s passes from QB1’s.

This requires us to play around a bit more with matrices, which is a good exercise! I’m going to keep working to see if this could fit into a tibble oriented workflow, but so far it’s a matrix up to the end. Credit to eipi10 for code I adapted from Stack Overflow. 9


qb_density_compare <- function(pass_df, qb1_name, qb2_name, n = 200){
  
  # filter to qb1
  qb1 <- pass_df %>% 
    select(x_coord, y_coord, name) %>% 
    filter(str_detect(name, qb1_name))
  
  #filter to qb2
  qb2 <- pass_df %>% 
    select(x_coord, y_coord, name) %>% 
    filter(str_detect(name, qb2_name))
  
  # get x/y coords as vectors
  qb1_x <- pull(qb1, x_coord)
  qb1_y <- pull(qb1, y_coord)
  
  # get x/y coords as vectors
  qb2_x <- pull(qb2, x_coord)
  qb2_y <- pull(qb2, y_coord)

  # get x and y range to compute comparisons across
  x_rng = range(c(qb1_x, qb2_x))
  y_rng = range(c(qb1_y, qb2_y))
  
  # Explicitly calculate bandwidth for future use
  bandwidth_x <- MASS::bandwidth.nrd(c(qb1_x, qb2_x))
  bandwidth_y <- MASS::bandwidth.nrd(c(qb1_y, qb2_y))
  
  bandwidth_calc <- c(bandwidth_x, bandwidth_y)
  
  # Calculate the 2d density estimate over the common range
  d2_qb1 = MASS::kde2d(qb1_x, qb1_y, h = bandwidth_calc, n=n, lims=c(x_rng, y_rng))
  d2_qb2 = MASS::kde2d(qb2_x, qb2_y, h = bandwidth_calc, n=n, lims=c(x_rng, y_rng))
  
  # create diff df
  qb_diff <- d2_qb1
  
  # matrix subtraction density from qb2 from qb1
  qb_diff$z <- d2_qb1$z - d2_qb2$z
  
  # add matrix col names
  colnames(qb_diff$z) = qb_diff$y
  
  #### return tidy tibble ####
  qb_diff$z %>% 
    # each col_name is actually the y_coord from the matrix
    as_tibble() %>% 
    # add back the x_coord
    mutate(x_coord= qb_diff$x) %>% 
    pivot_longer(-x_coord, names_to = "y_coord", values_to = "z") %>% 
    mutate(y_coord = as.double(y_coord),
           bandwidth = list(bandwidth_calc),
           comparison = glue::glue("{qb1_name} (QB1) vs {qb2_name} (QB2)"))

}

Call the function

Now we can just call the function to generate a beautiful dataset to plot or work with!

Let’s compare… Mahomes vs Rodgers.


compared_z <- qb_density_compare(pass_map_df, "Mahomes", "Rodgers", n = 200) 

(compared_plot <- compared_z %>% 
  ggplot(aes(x_coord, y_coord)) +
  
  # add core heatmap - note that geom_raster or geom_tile both work
  geom_raster(aes(x_coord, y_coord, fill=z))  +
  
  # add contour polygon lines around the most dense points
  stat_contour(aes(color=..level.., z = z)) +
  
  # add a fill gradient from low (blue) to high (red) 
  # with white as the zero midpoint
  scale_fill_gradient2(low="blue",mid="white", high="red", midpoint=0) +
  scale_color_gradient2(low="blue", mid="white", high="red", midpoint=0) +
  # drop the legends
  guides(color=FALSE, fill = FALSE) +
  add_moo_field() +
  labs(title = unique(compared_z$comparison),
       subtitle = "Color is more passes by <span style='color:red'>**QB1**</span> or by <span style='color:blue'>**QB2**</span>",
       caption = "Plot: @thomas_mock | Data: @ChiefsAnalytics")) +
  # add some customizations to the plot
  theme(legend.position = "top", legend.key.width = unit(2, "cm"),
        plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
        plot.subtitle = element_markdown(size = 12, hjust = 0.5),
        plot.caption = element_text(face = "bold")) 

Try it out for yourself!


  1. Douglas (2020, Aug. 21). Open Source Football: NFL Pass Location Visualization. Retrieved from https://mrcaseb.github.io/open-source-football/posts/2020-08-22-nfl-pass-location-visualization/↩︎

  2. https://arxiv.org/abs/1906.03339↩︎

  3. https://clauswilke.com/dataviz/↩︎

  4. https://github.com/clauswilke/dataviz/blob/master/visualizing_distributions_I.Rmd#L303-L323↩︎

  5. https://ggplot2-book.org/statistical-summaries.html#overplotting↩︎

  6. Douglas (2020, Aug. 21). Open Source Football: NFL Pass Location Visualization. Retrieved from https://mrcaseb.github.io/open-source-football/posts/2020-08-22-nfl-pass-location-visualization/↩︎

  7. https://twitter.com/PFF_Moo/status/1235681100837486592?s=20↩︎

  8. Function adapted from Kamil Slowikowski↩︎

  9. https://stackoverflow.com/questions/28521145/r-calculate-and-plot-difference-between-two-density-countours↩︎

Citation

For attribution, please cite this work as

Mock (2020, Aug. 28). The Mockup Blog: Heatmaps in ggplot2. Retrieved from https://themockup.blog/posts/2020-08-28-heatmaps-in-ggplot2/

BibTeX citation

@misc{mock2020heatmaps,
  author = {Mock, Thomas},
  title = {The Mockup Blog: Heatmaps in ggplot2},
  url = {https://themockup.blog/posts/2020-08-28-heatmaps-in-ggplot2/},
  year = {2020}
}