Joins vs case whens - speed and memory tradeoffs

tidyverse data cleaning joins

Spoiler - joins are fastest, and can be easy to prep!

Thomas Mock https://twitter.com/thomas_mock
02-13-2021

Matching names

There are many times where you may have a reference dataset that has some type of short-code, abbreviations, nicknames, or even misspellings. In this situation we want to create a new column or even overwrite the existing column with the correct spelling, the full name, etc.

While there isn’t a wrong way to do this, I’ve made mistakes in the past of making things more complicated for myself during data cleaning/prep steps. Most importantly to me in my data cleaning steps are “speed of thought” - aka how long it takes me to type out or reason through, but there are also tradeoffs in memory usage and in execution speed.

I’m going to step through some strategies, honed by years of TidyTuesday data cleaning and prep!


In general I want to preface all these examples with the fact that this exercise is easily translatable to various join options. Joins are great, but there are situations where the logicals aren’t simple matches/index (like x == "val") but rather more complex queries (like x > 5 & var %in% c("val", "val2")), but I’m going to live within the constraints of my example use case. Also note that joins could join on more than one key or add as many variables as needed rather than just one, so there’s additional complexity possible here.


To avoid burying the lede, joins are by far the fastest and most memory-efficient method if you can fit your problem to that solution, data.table::fcase() provides a easy to use and remarkably fast/memory-efficient intermediate method that works in base R, data.tables or dplyr, and dplyr::case_when() is the slowest/most-memory hungry for in-memory cases, BUT is probably fast enough for relatively small data (< 1,000,000 rows) and is the option that allows pushing compute into a remote backend via dbplyr/sparklyr.


Example Data

We’re going to simulate a dataset with NFL team abbreviations and a “stat” that is relatively in scope with something like EPA, but again this is all simulated data.

library(tidyverse)

# relatively nice sized dataset (100,000 rows)
sample_size <- 100000

# Get example team names from espn
all_teams <- espnscrapeR::get_nfl_teams()

# set a reproducible seed (h/t to Anthony's wife's new $4 luggage...)
set.seed(829)

ex_df <- tibble(
  team = sample(all_teams$team_short_name, size = sample_size, replace = TRUE),
  stat = rnorm(sample_size, mean = 0.1, sd = 0.1)
)

ex_df
# A tibble: 100,000 x 2
   team     stat
   <chr>   <dbl>
 1 HOU    0.0593
 2 JAX   -0.0534
 3 DEN   -0.0488
 4 NO    -0.0527
 5 CAR    0.111 
 6 CHI    0.132 
 7 CAR    0.0481
 8 IND    0.412 
 9 HOU    0.0425
10 ARI    0.200 
# … with 99,990 more rows

dplyr::case_when() solution

There’s a fantastic function in the dplyr package called case_when().

case_when:

This function allows you to vectorise multiple if_else()

In essence, this provides the ability to provide many logical statements to generate a specific outcome. You can read more about the syntax via: ?dplyr::case_when() in the R console or the docs. This functions works on vectors as well as in mutations inside data.frames/tibbles/data.tables. Lastly, it can also be translated into SQL automatically via dbplyr in situations where you don’t want to bring all the data into memory.

case_when is very readable to me, and feels better than a whole bundle of nested ifelse/if_else statements.


A quick reprex of case_when() usage.

ex_car <- mpg %>% 
  select(model, cyl, cty, hwy) %>% 
  slice_sample(n = 100)

ex_car %>% 
  mutate(
    cyl_character = case_when(
      # if logical statement is TRUE then assign new value
      cyl == 4 ~ "Four",
      cyl == 6 ~ "Six",
      cyl == 8 ~ "Eight",
      # This last statement means, if none of the above conditions are met
      # then return a NA value, note that case_when is type-safe
      # so you'll want to use the various NA_??? for the type of output
      TRUE ~ NA_character_
    )
  )
# A tibble: 100 x 5
   model                 cyl   cty   hwy cyl_character
   <chr>               <int> <int> <int> <chr>        
 1 a4                      4    20    31 Four         
 2 malibu                  4    22    30 Four         
 3 ram 1500 pickup 4wd     8     9    12 Eight        
 4 jetta                   5    21    29 <NA>         
 5 passat                  4    18    29 Four         
 6 camry solara            6    18    26 Six          
 7 f150 pickup 4wd         8    13    16 Eight        
 8 a4 quattro              6    17    25 Six          
 9 grand cherokee 4wd      8    14    19 Eight        
10 jetta                   4    22    29 Four         
# … with 90 more rows

Now that we understand the syntax a bit better, let’s try it on our dataset of interest! Remember, we want to take our team abbreviations and add the team names into the dataset. So we’ll need to provide a logical match of each team to the correct abbreviation and output the corresponding full name.

This works out very well and we got all the names in the right spot, and with the basic timing via tictoc we can see it was still fast. However, there’s a LOT of typing, and more importantly a lot of repetitive typing to get this done.

library(tictoc)

tic()
ex_df %>% 
    mutate(
      team_name = case_when(
        team == "ARI" ~ "Cardinals",
        team == "ATL" ~ "Falcons",
        team == "BAL" ~ "Ravens",
        team == "BUF" ~ "Bills",
        team == "CAR" ~ "Panthers",
        team == "CHI" ~ "Bears",
        team == "CIN" ~ "Bengals",
        team == "CLE" ~ "Browns",
        team == "DAL" ~ "Cowboys",
        team == "DEN" ~ "Broncos",
        team == "DET" ~ "Lions",
        team == "GB" ~ "Packers",
        team == "HOU" ~ "Texans",
        team == "IND" ~ "Colts",
        team == "JAX" ~ "Jaguars",
        team == "KC" ~ "Chiefs",
        team == "LV" ~ "Raiders",
        team == "LAC" ~ "Chargers",
        team == "LAR" ~ "Rams",
        team == "MIA" ~ "Dolphins",
        team == "MIN" ~ "Vikings",
        team == "NE" ~ "Patriots",
        team == "NO" ~ "Saints",
        team == "NYG" ~ "Giants",
        team == "NYJ" ~ "Jets",
        team == "PHI" ~ "Eagles",
        team == "PIT" ~ "Steelers",
        team == "SF" ~ "49ers",
        team == "SEA" ~ "Seahawks",
        team == "TB" ~ "Buccaneers",
        team == "TEN" ~ "Titans",
        team == "WSH" ~ NA_character_,
        TRUE ~ NA_character_
      )
    )
# A tibble: 100,000 x 3
   team     stat team_name
   <chr>   <dbl> <chr>    
 1 HOU    0.0593 Texans   
 2 JAX   -0.0534 Jaguars  
 3 DEN   -0.0488 Broncos  
 4 NO    -0.0527 Saints   
 5 CAR    0.111  Panthers 
 6 CHI    0.132  Bears    
 7 CAR    0.0481 Panthers 
 8 IND    0.412  Colts    
 9 HOU    0.0425 Texans   
10 ARI    0.200  Cardinals
# … with 99,990 more rows
toc()
0.243 sec elapsed

We can use a few tricks in RStudio to make this easier!

We’ll use “multiple cursors” in RStudio and combine that with the amazing {datapasta} package from Miles McBain.

datapasta is about reducing resistance associated with copying and pasting data to and from R. It is a response to the realisation that I often found myself using intermediate programs like Sublime to munge text into suitable formats. Addins and functions in datapasta support a wide variety of input and output situations, so it (probably) “just works”. Hopefully tools in this package will remove such intermediate steps and associated frustrations from our data slinging workflows.

Step 1: datapasta

We’ll first create a character vector via datapasta, using the below code:

ex_df %>% 
  distinct(team) %>% 
  arrange(team) %>% 
  pull() %>% 
  datapasta::vector_paste_vertical()

I’ve shown a video of me stepping through this code since there are a lot of different options for datapasta and its outputs.

This gives us a nice starting point of all the vector items and we are ready to work with them in our script.

Step 2: Multiple line selection in RStudio

I’ve done this as a video as it’s fairly complex to “write about”, but very easy to show! The video describes the process of enabling multiple selections in RStudio.

Note that the enabling multi-line selection is done in RStudio via:

Once we have this “skeleton” quickly created we can then add the individual “match outputs”, like “Cardinals” for “ARI”, “Falcons” for “ATL”, etc, but we’ve saved ourself the need to type out a lot of the repeated “skeleton” of the case_when().

A join solution

While the above method is pretty quick to create and understand, we still had to type quite a bit (which is decreased if you use multiple cursors), and in many situations a true join is more efficient in terms of typing, “speed of thought”, and execution time.

team_join <- all_teams %>% 
  select(team_name, team = team_short_name)

tic()
left_join(ex_df, team_join, by = "team")
# A tibble: 100,000 x 3
   team     stat team_name
   <chr>   <dbl> <chr>    
 1 HOU    0.0593 Texans   
 2 JAX   -0.0534 Jaguars  
 3 DEN   -0.0488 Broncos  
 4 NO    -0.0527 Saints   
 5 CAR    0.111  Panthers 
 6 CHI    0.132  Bears    
 7 CAR    0.0481 Panthers 
 8 IND    0.412  Colts    
 9 HOU    0.0425 Texans   
10 ARI    0.200  Cardinals
# … with 99,990 more rows
toc()
0.026 sec elapsed

Now you may say, well of course I could just use a join if I have a perfect dataset, what do I do with data cleaning?! I would say that a while ago this was a revelation as I typically only would move to key-value pairs + joins when it was obvious, and sometimes I’d end up using a case when solution when a join was easily possible.

So, keep in mind that we can often “create” a joining dataset and use it pretty easily!

tribble

We can use tribble to use almost the exact same setup as the case_when(). We can even use the same multiple cursor workflow to take alter our existing case_when() code and turn it into the below tribble setup.

tribble(
  ~team, ~team_name,
  "ARI", "Cardinals",
  "ATL", "Falcons",
  "BAL", "Ravens",
  "BUF", "Bills",
  "CAR", "Panthers",
  "CHI", "Bears",
  "CIN", "Bengals",
  "CLE", "Browns",
  "DAL", "Cowboys",
  "DEN", "Broncos",
  "DET", "Lions",
  "GB", "Packers",
  "HOU", "Texans",
  "IND", "Colts",
  "JAX", "Jaguars",
  "KC", "Chiefs",
  "LV", "Raiders",
  "LAC", "Chargers",
  "LAR", "Rams",
  "MIA", "Dolphins",
  "MIN", "Vikings",
  "NE", "Patriots",
  "NO", "Saints",
  "NYG", "Giants",
  "NYJ", "Jets",
  "PHI", "Eagles",
  "PIT", "Steelers",
  "SF", "49ers",
  "SEA", "Seahawks",
  "TB", "Buccaneers",
  "TEN", "Titans",
  "WSH", NA_character_
  )
# A tibble: 32 x 2
   team  team_name
   <chr> <chr>    
 1 ARI   Cardinals
 2 ATL   Falcons  
 3 BAL   Ravens   
 4 BUF   Bills    
 5 CAR   Panthers 
 6 CHI   Bears    
 7 CIN   Bengals  
 8 CLE   Browns   
 9 DAL   Cowboys  
10 DEN   Broncos  
# … with 22 more rows

You could also just create a more traditional data.frame/tibble, and do the same thing without having to “align” all the rows visually. I do find it a bit easier to look at it row-by-row in the tribble so that I can sanity check myself as I go, but up to the reader to decide!

ex_df %>% 
  distinct(team) %>% 
  arrange(team) %>% 
  mutate(
    team_name = 
      c(
        "Cardinals", "Falcons", "Ravens", "Bills", "Panthers", "Bears", 
        "Bengals", "Browns", "Cowboys", "Broncos", "Lions", "Packers", "Texans", 
        "Colts", "Jaguars", "Chiefs", "Raiders", "Chargers", "Rams", "Dolphins", 
        "Vikings", "Patriots", "Saints", "Giants", "Jets", "Eagles", "Steelers", 
        "49ers", "Seahawks", "Buccaneers", "Titans", NA_character_
    )
  )
# A tibble: 32 x 2
   team  team_name
   <chr> <chr>    
 1 ARI   Cardinals
 2 ATL   Falcons  
 3 BAL   Ravens   
 4 BUF   Bills    
 5 CAR   Panthers 
 6 CHI   Bears    
 7 CIN   Bengals  
 8 CLE   Browns   
 9 DAL   Cowboys  
10 DEN   Broncos  
# … with 22 more rows

data.table::fcase

Now there is also an equivalent function to case_when() in data.table called fcase (short for fast case when) as of late 2019. Since it operates on any vector, you can use it as a drop in replacement for most in-memory cases of case_when() in dplyr, or use it in native data.table objects. The expectation is that it will operate similarly in output but be more efficient with regards to memory/time.

Note, you do not have the ability to push the translation of fcase() into SQL though, as dbplyr only has SQL-translation for case_when().

The syntax of fcase() and case_when() are essentially the same, but rather than a formula-syntax for assignment via ~, it relies on another comma to assign by position.

Compare the syntax below:

library(data.table)

mtcars %>% 
  tibble() %>%
  select(mpg, cyl) %>% 
  mutate(
    cyl_case_when = case_when(
      # relies on `~` to indicate assign
      # read as if cyl is equal to 4 assign "Four"
      cyl == 4 ~ "Four",
      cyl == 6 ~ "Six",
      cyl == 8 ~ "Eight",
      TRUE ~ NA_character_
    ),
    cyl_fcase = fcase(
      # relies on another `,` to assign by position
      # reas as if cyl is equal to 4, "Four"
      cyl == 4, "Four",
      cyl == 6, "Six",
      cyl == 8, "Eight",
      TRUE, NA_character_
    ),
    cyl_match = cyl_case_when == cyl_fcase
  )
# A tibble: 32 x 5
     mpg   cyl cyl_case_when cyl_fcase cyl_match
   <dbl> <dbl> <chr>         <chr>     <lgl>    
 1  21       6 Six           Six       TRUE     
 2  21       6 Six           Six       TRUE     
 3  22.8     4 Four          Four      TRUE     
 4  21.4     6 Six           Six       TRUE     
 5  18.7     8 Eight         Eight     TRUE     
 6  18.1     6 Six           Six       TRUE     
 7  14.3     8 Eight         Eight     TRUE     
 8  24.4     4 Four          Four      TRUE     
 9  22.8     4 Four          Four      TRUE     
10  19.2     6 Six           Six       TRUE     
# … with 22 more rows

Now, I have converted our example with the case_when() statement into fcase() for both dplyr and data.table, but have hidden the code in the detail box below since it’s almost identical and would be quite repetitive. I have printed a test to see if the outcomes are equivalent though between native data.table and dplyr, and they are! 🥳

Really happy to have the option of using some various methods in-memory via fcase() and/or case_when() and optionally if working with REALLY large data push some of that compute into the database via dbplyr + case_when().

Show code
tic()
dplyr_fcase <- ex_df %>% 
    mutate(
      team_name = fcase(
        team == "ARI", "Cardinals",
        team == "ATL", "Falcons",
        team == "BAL", "Ravens",
        team == "BUF", "Bills",
        team == "CAR", "Panthers",
        team == "CHI", "Bears",
        team == "CIN", "Bengals",
        team == "CLE", "Browns",
        team == "DAL", "Cowboys",
        team == "DEN", "Broncos",
        team == "DET", "Lions",
        team == "GB" ,"Packers",
        team == "HOU", "Texans",
        team == "IND", "Colts",
        team == "JAX", "Jaguars",
        team == "KC" ,"Chiefs",
        team == "LV" ,"Raiders",
        team == "LAC", "Chargers",
        team == "LAR", "Rams",
        team == "MIA", "Dolphins",
        team == "MIN", "Vikings",
        team == "NE" ,"Patriots",
        team == "NO" ,"Saints",
        team == "NYG", "Giants",
        team == "NYJ", "Jets",
        team == "PHI", "Eagles",
        team == "PIT", "Steelers",
        team == "SF" ,"49ers",
        team == "SEA", "Seahawks",
        team == "TB" ,"Buccaneers",
        team == "TEN", "Titans",
        team == "WSH", NA_character_,
        TRUE, NA_character_
      )
    )
toc()
0.05 sec elapsed
Show code
# data.table native
dt_fcase <- data.table(ex_df)[, team_name := fcase(
    team == "ARI", "Cardinals",
    team == "ATL", "Falcons",
    team == "BAL", "Ravens",
    team == "BUF", "Bills",
    team == "CAR", "Panthers",
    team == "CHI", "Bears",
    team == "CIN", "Bengals",
    team == "CLE", "Browns",
    team == "DAL", "Cowboys",
    team == "DEN", "Broncos",
    team == "DET", "Lions",
    team == "GB" ,"Packers",
    team == "HOU", "Texans",
    team == "IND", "Colts",
    team == "JAX", "Jaguars",
    team == "KC" ,"Chiefs",
    team == "LV" ,"Raiders",
    team == "LAC", "Chargers",
    team == "LAR", "Rams",
    team == "MIA", "Dolphins",
    team == "MIN", "Vikings",
    team == "NE" ,"Patriots",
    team == "NO" ,"Saints",
    team == "NYG", "Giants",
    team == "NYJ", "Jets",
    team == "PHI", "Eagles",
    team == "PIT", "Steelers",
    team == "SF" ,"49ers",
    team == "SEA", "Seahawks",
    team == "TB" ,"Buccaneers",
    team == "TEN", "Titans",
    team == "WSH", NA_character_,
    TRUE, NA_character_
  )] %>% 
    as_tibble()
all_equal(dplyr_fcase, dt_fcase)
[1] TRUE

Speed and Memory

I’m going to preface this by also saying that this is a particular straight forward problem to translate into a join, and it has 32 total comparisons. Most uses of case_when() or fcase() will be more traditional and/or complex logic problems and be much fewer than 30!!!

Now with that being said, you may still say, well I’m not convinced that join() method is any easier or faster to code out for humans, and we’re all free to make our own decisions! I do want to note that case_when() gets memory-inefficient much faster than ???_join().

We can really highlight the differences in the execution time and the memory allocated via the {bench} package. I’m using bench to execute the two different methods 3x time, compare the timing and the memory used, along with some other stats.

In the expandable section below we have a repeat of our above left_join() and case_when() calls.

Show code
join_expr <- function(){
  left_join(ex_df, team_join, by = "team")
}

case_when_expr <- function(){
  ex_df %>% 
    mutate(
      team_name = case_when(
        team == "ARI" ~ "Cardinals",
        team == "ATL" ~ "Falcons",
        team == "BAL" ~ "Ravens",
        team == "BUF" ~ "Bills",
        team == "CAR" ~ "Panthers",
        team == "CHI" ~ "Bears",
        team == "CIN" ~ "Bengals",
        team == "CLE" ~ "Browns",
        team == "DAL" ~ "Cowboys",
        team == "DEN" ~ "Broncos",
        team == "DET" ~ "Lions",
        team == "GB" ~ "Packers",
        team == "HOU" ~ "Texans",
        team == "IND" ~ "Colts",
        team == "JAX" ~ "Jaguars",
        team == "KC" ~ "Chiefs",
        team == "LV" ~ "Raiders",
        team == "LAC" ~ "Chargers",
        team == "LAR" ~ "Rams",
        team == "MIA" ~ "Dolphins",
        team == "MIN" ~ "Vikings",
        team == "NE" ~ "Patriots",
        team == "NO" ~ "Saints",
        team == "NYG" ~ "Giants",
        team == "NYJ" ~ "Jets",
        team == "PHI" ~ "Eagles",
        team == "PIT" ~ "Steelers",
        team == "SF" ~ "49ers",
        team == "SEA" ~ "Seahawks",
        team == "TB" ~ "Buccaneers",
        team == "TEN" ~ "Titans",
        team == "WSH" ~ NA_character_,
        TRUE ~ NA_character_
      )
    )
} 

We can then compare their execution multiple times with the bench package. This will vary by the execution, but with 3 iterations and 100,000 rows, I have seen about a 10x speed improvement in left_join vs case_when. Note that in most cases this is still pretty much instantaneous in “human time”.

However, do note that we use about 28-30 times more memory for the case_when() statement.

bench::mark(
  min_time = 0.1,
  max_iterations = 3,
  min_iterations = 3,
  case_when_expr(),
  join_expr()
) %>% 
  select(expression, min, median, mem_alloc, n_itr)
# A tibble: 2 x 4
  expression            min   median mem_alloc
  <bch:expr>       <bch:tm> <bch:tm> <bch:byt>
1 case_when_expr() 154.93ms  217.7ms  150.81MB
2 join_expr()        9.79ms   13.7ms    5.34MB

The 150 Mb of memory used, may not seem like a lot with smaller datasets (100,000), but if we were to bump this up to 1,000,000 rows we see a similar 10x increase of memory. Specifically we are now up to ~1.5 Gb of memory used for case_when(). So if you’re going beyond the millions or 10s of millions of rows AND are trying to do a LOT of comparisons, probably a good idea to start refactoring into a join if possible!

Show code
# 1,000,000 rows as 1e6
sample_size <- 1e6

set.seed(829)

ex_df <- tibble(
  team = sample(all_teams$team_short_name, size = sample_size, replace = TRUE),
  stat = rnorm(sample_size, mean = 0.1, sd = 0.1)
)

bench::mark(
  min_time = 0.1,
  max_iterations = 3,
  min_iterations = 3,
  case_when_expr(),
  join_expr()
) %>% 
  select(expression, min, median, mem_alloc, n_itr)
# A tibble: 2 x 4
  expression            min   median mem_alloc
  <bch:expr>       <bch:tm> <bch:tm> <bch:byt>
1 case_when_expr()     1.3s    1.36s    1.47GB
2 join_expr()        71.3ms  74.53ms   53.41MB

There are situations where a join doesn’t solve the problem, so we can go one step further and add in our data.table::fcase() or dtplyr-translation of dplyr into data.table. All the example reprex code is in an expandable chunk below.

Show code
library(dtplyr)

case_when_dplyr <- function(){
  ex_df %>% 
    mutate(
      team_name = case_when(
        team == "ARI" ~ "Cardinals",
        team == "ATL" ~ "Falcons",
        team == "BAL" ~ "Ravens",
        team == "BUF" ~ "Bills",
        team == "CAR" ~ "Panthers",
        team == "CHI" ~ "Bears",
        team == "CIN" ~ "Bengals",
        team == "CLE" ~ "Browns",
        team == "DAL" ~ "Cowboys",
        team == "DEN" ~ "Broncos",
        team == "DET" ~ "Lions",
        team == "GB" ~ "Packers",
        team == "HOU" ~ "Texans",
        team == "IND" ~ "Colts",
        team == "JAX" ~ "Jaguars",
        team == "KC" ~ "Chiefs",
        team == "LV" ~ "Raiders",
        team == "LAC" ~ "Chargers",
        team == "LAR" ~ "Rams",
        team == "MIA" ~ "Dolphins",
        team == "MIN" ~ "Vikings",
        team == "NE" ~ "Patriots",
        team == "NO" ~ "Saints",
        team == "NYG" ~ "Giants",
        team == "NYJ" ~ "Jets",
        team == "PHI" ~ "Eagles",
        team == "PIT" ~ "Steelers",
        team == "SF" ~ "49ers",
        team == "SEA" ~ "Seahawks",
        team == "TB" ~ "Buccaneers",
        team == "TEN" ~ "Titans",
        team == "WSH" ~ NA_character_,
        TRUE ~ NA_character_
      )
    )
} 


join_dplyr <- function(){
  left_join(ex_df, team_join, by = "team")
}

join_dtplyr <- function(){
  dt_ex_df <- lazy_dt(ex_df)
  team_join <- lazy_dt(team_join)
    
  dt_ex_df %>% 
    left_join(team_join, by = "team") %>% 
    select(team, stat, team_name) %>% 
    as_tibble() 
}

case_when_dtplyr <- function(){
  lazy_dt(ex_df) %>% 
    mutate(
      team_name = case_when(
        team == "ARI" ~ "Cardinals",
        team == "ATL" ~ "Falcons",
        team == "BAL" ~ "Ravens",
        team == "BUF" ~ "Bills",
        team == "CAR" ~ "Panthers",
        team == "CHI" ~ "Bears",
        team == "CIN" ~ "Bengals",
        team == "CLE" ~ "Browns",
        team == "DAL" ~ "Cowboys",
        team == "DEN" ~ "Broncos",
        team == "DET" ~ "Lions",
        team == "GB" ~ "Packers",
        team == "HOU" ~ "Texans",
        team == "IND" ~ "Colts",
        team == "JAX" ~ "Jaguars",
        team == "KC" ~ "Chiefs",
        team == "LV" ~ "Raiders",
        team == "LAC" ~ "Chargers",
        team == "LAR" ~ "Rams",
        team == "MIA" ~ "Dolphins",
        team == "MIN" ~ "Vikings",
        team == "NE" ~ "Patriots",
        team == "NO" ~ "Saints",
        team == "NYG" ~ "Giants",
        team == "NYJ" ~ "Jets",
        team == "PHI" ~ "Eagles",
        team == "PIT" ~ "Steelers",
        team == "SF" ~ "49ers",
        team == "SEA" ~ "Seahawks",
        team == "TB" ~ "Buccaneers",
        team == "TEN" ~ "Titans",
        team == "WSH" ~ NA_character_,
        TRUE ~ NA_character_
      )
    ) %>% 
    as_tibble()
  
}

fcase_dplyr <- function(){
  ex_df %>% 
    mutate(
      team_name = fcase(
        team == "ARI", "Cardinals",
        team == "ATL", "Falcons",
        team == "BAL", "Ravens",
        team == "BUF", "Bills",
        team == "CAR", "Panthers",
        team == "CHI", "Bears",
        team == "CIN", "Bengals",
        team == "CLE", "Browns",
        team == "DAL", "Cowboys",
        team == "DEN", "Broncos",
        team == "DET", "Lions",
        team == "GB" ,"Packers",
        team == "HOU", "Texans",
        team == "IND", "Colts",
        team == "JAX", "Jaguars",
        team == "KC" ,"Chiefs",
        team == "LV" ,"Raiders",
        team == "LAC", "Chargers",
        team == "LAR", "Rams",
        team == "MIA", "Dolphins",
        team == "MIN", "Vikings",
        team == "NE" ,"Patriots",
        team == "NO" ,"Saints",
        team == "NYG", "Giants",
        team == "NYJ", "Jets",
        team == "PHI", "Eagles",
        team == "PIT", "Steelers",
        team == "SF" ,"49ers",
        team == "SEA", "Seahawks",
        team == "TB" ,"Buccaneers",
        team == "TEN", "Titans",
        team == "WSH", NA_character_
      )
    )
}
  

fcase_dt_native <- function(){
  data.table(ex_df)[, team_name := fcase(
    team == "ARI", "Cardinals",
    team == "ATL", "Falcons",
    team == "BAL", "Ravens",
    team == "BUF", "Bills",
    team == "CAR", "Panthers",
    team == "CHI", "Bears",
    team == "CIN", "Bengals",
    team == "CLE", "Browns",
    team == "DAL", "Cowboys",
    team == "DEN", "Broncos",
    team == "DET", "Lions",
    team == "GB" ,"Packers",
    team == "HOU", "Texans",
    team == "IND", "Colts",
    team == "JAX", "Jaguars",
    team == "KC" ,"Chiefs",
    team == "LV" ,"Raiders",
    team == "LAC", "Chargers",
    team == "LAR", "Rams",
    team == "MIA", "Dolphins",
    team == "MIN", "Vikings",
    team == "NE" ,"Patriots",
    team == "NO" ,"Saints",
    team == "NYG", "Giants",
    team == "NYJ", "Jets",
    team == "PHI", "Eagles",
    team == "PIT", "Steelers",
    team == "SF" ,"49ers",
    team == "SEA", "Seahawks",
    team == "TB" ,"Buccaneers",
    team == "TEN", "Titans",
    team == "WSH", NA_character_
  )] %>% 
    as_tibble()
}

join_dt_native <- function(){
  data.table(ex_df)[data.table(team_join), on = .(team), team_name := team_name] %>% 
    as_tibble()
}

Finally we can check the timing/memory usage for all of the combos. In short, join()-methods are fastest and use the least memory, fcase() whether in native data.table or dplyr is a bit slower/more memory than join but still ~5x faster/more memory efficient than case_when(), and case_when() is the slowest/most memory hungry (but translates into SQL if needed).

Overall, I’m really happy that we have the ability to mix and match functions from various awesome packages depending on the problems we’re trying to solve!

Show code
bench::mark(
  min_time = 0.1,
  max_iterations = 3,
  min_iterations = 3,
  case_when_dplyr(),
  case_when_dtplyr(),
  fcase_dplyr(),
  fcase_dt_native(),
  join_dplyr(),
  join_dt_native(),
  check = FALSE
) %>% 
  select(expression, min, median, mem_alloc, n_itr)
# A tibble: 6 x 4
  expression              min   median mem_alloc
  <bch:expr>         <bch:tm> <bch:tm> <bch:byt>
1 case_when_dplyr()     1.46s    1.49s    1.47GB
2 case_when_dtplyr()    1.49s    1.54s    1.51GB
3 fcase_dplyr()      320.37ms 324.42ms  133.75MB
4 fcase_dt_native()  310.85ms 313.07ms  164.42MB
5 join_dplyr()        76.48ms  80.98ms   53.41MB
6 join_dt_native()    76.21ms  87.23ms   61.63MB

Citation

For attribution, please cite this work as

Mock (2021, Feb. 13). The Mockup Blog: Joins vs case whens - speed and memory tradeoffs. Retrieved from https://themockup.blog/posts/2021-02-13-joins-vs-casewhen-speed-and-memory-tradeoffs/

BibTeX citation

@misc{mock2021joins,
  author = {Mock, Thomas},
  title = {The Mockup Blog: Joins vs case whens - speed and memory tradeoffs},
  url = {https://themockup.blog/posts/2021-02-13-joins-vs-casewhen-speed-and-memory-tradeoffs/},
  year = {2021}
}