Spoiler - joins are fastest, and can be easy to prep!
There are many times where you may have a reference dataset that has some type of short-code, abbreviations, nicknames, or even misspellings. In this situation we want to create a new column or even overwrite the existing column with the correct spelling, the full name, etc.
While there isn’t a wrong way to do this, I’ve made mistakes in the past of making things more complicated for myself during data cleaning/prep steps. Most importantly to me in my data cleaning steps are “speed of thought” - aka how long it takes me to type out or reason through, but there are also tradeoffs in memory usage and in execution speed.
I’m going to step through some strategies, honed by years of TidyTuesday data cleaning and prep!
In general I want to preface all these examples with the fact that this exercise is easily translatable to various join
options. Joins are great, but there are situations where the logicals aren’t simple matches/index (like x == "val"
) but rather more complex queries (like x > 5 & var %in% c("val", "val2")
), but I’m going to live within the constraints of my example use case. Also note that joins could join on more than one key or add as many variables as needed rather than just one, so there’s additional complexity possible here.
To avoid burying the lede, joins are by far the fastest and most memory-efficient method if you can fit your problem to that solution, data.table::fcase()
provides a easy to use and remarkably fast/memory-efficient intermediate method that works in base R, data.tables
or dplyr
, and dplyr::case_when()
is the slowest/most-memory hungry for in-memory cases, BUT is probably fast enough for relatively small data (< 1,000,000 rows) and is the option that allows pushing compute into a remote backend via dbplyr
/sparklyr
.
We’re going to simulate a dataset with NFL team abbreviations and a “stat” that is relatively in scope with something like EPA, but again this is all simulated data.
library(tidyverse)
# relatively nice sized dataset (100,000 rows)
sample_size <- 100000
# Get example team names from espn
all_teams <- espnscrapeR::get_nfl_teams()
# set a reproducible seed (h/t to Anthony's wife's new $4 luggage...)
set.seed(829)
ex_df <- tibble(
team = sample(all_teams$team_short_name, size = sample_size, replace = TRUE),
stat = rnorm(sample_size, mean = 0.1, sd = 0.1)
)
ex_df
# A tibble: 100,000 x 2
team stat
<chr> <dbl>
1 HOU 0.0593
2 JAX -0.0534
3 DEN -0.0488
4 NO -0.0527
5 CAR 0.111
6 CHI 0.132
7 CAR 0.0481
8 IND 0.412
9 HOU 0.0425
10 ARI 0.200
# … with 99,990 more rows
dplyr::case_when()
solutionThere’s a fantastic function in the dplyr
package called case_when()
.
This function allows you to vectorise multiple if_else()
In essence, this provides the ability to provide many logical statements to generate a specific outcome. You can read more about the syntax via: ?dplyr::case_when()
in the R console or the docs. This functions works on vectors as well as in mutations inside data.frames/tibbles/data.tables. Lastly, it can also be translated into SQL automatically via dbplyr
in situations where you don’t want to bring all the data into memory.
case_when
is very readable to me, and feels better than a whole bundle of nested ifelse
/if_else
statements.
A quick reprex
of case_when()
usage.
ex_car <- mpg %>%
select(model, cyl, cty, hwy) %>%
slice_sample(n = 100)
ex_car %>%
mutate(
cyl_character = case_when(
# if logical statement is TRUE then assign new value
cyl == 4 ~ "Four",
cyl == 6 ~ "Six",
cyl == 8 ~ "Eight",
# This last statement means, if none of the above conditions are met
# then return a NA value, note that case_when is type-safe
# so you'll want to use the various NA_??? for the type of output
TRUE ~ NA_character_
)
)
# A tibble: 100 x 5
model cyl cty hwy cyl_character
<chr> <int> <int> <int> <chr>
1 a4 4 20 31 Four
2 malibu 4 22 30 Four
3 ram 1500 pickup 4wd 8 9 12 Eight
4 jetta 5 21 29 <NA>
5 passat 4 18 29 Four
6 camry solara 6 18 26 Six
7 f150 pickup 4wd 8 13 16 Eight
8 a4 quattro 6 17 25 Six
9 grand cherokee 4wd 8 14 19 Eight
10 jetta 4 22 29 Four
# … with 90 more rows
Now that we understand the syntax a bit better, let’s try it on our dataset of interest! Remember, we want to take our team abbreviations and add the team names into the dataset. So we’ll need to provide a logical match of each team
to the correct abbreviation and output the corresponding full name.
This works out very well and we got all the names in the right spot, and with the basic timing via tictoc
we can see it was still fast. However, there’s a LOT of typing, and more importantly a lot of repetitive typing to get this done.
library(tictoc)
tic()
ex_df %>%
mutate(
team_name = case_when(
team == "ARI" ~ "Cardinals",
team == "ATL" ~ "Falcons",
team == "BAL" ~ "Ravens",
team == "BUF" ~ "Bills",
team == "CAR" ~ "Panthers",
team == "CHI" ~ "Bears",
team == "CIN" ~ "Bengals",
team == "CLE" ~ "Browns",
team == "DAL" ~ "Cowboys",
team == "DEN" ~ "Broncos",
team == "DET" ~ "Lions",
team == "GB" ~ "Packers",
team == "HOU" ~ "Texans",
team == "IND" ~ "Colts",
team == "JAX" ~ "Jaguars",
team == "KC" ~ "Chiefs",
team == "LV" ~ "Raiders",
team == "LAC" ~ "Chargers",
team == "LAR" ~ "Rams",
team == "MIA" ~ "Dolphins",
team == "MIN" ~ "Vikings",
team == "NE" ~ "Patriots",
team == "NO" ~ "Saints",
team == "NYG" ~ "Giants",
team == "NYJ" ~ "Jets",
team == "PHI" ~ "Eagles",
team == "PIT" ~ "Steelers",
team == "SF" ~ "49ers",
team == "SEA" ~ "Seahawks",
team == "TB" ~ "Buccaneers",
team == "TEN" ~ "Titans",
team == "WSH" ~ NA_character_,
TRUE ~ NA_character_
)
)
# A tibble: 100,000 x 3
team stat team_name
<chr> <dbl> <chr>
1 HOU 0.0593 Texans
2 JAX -0.0534 Jaguars
3 DEN -0.0488 Broncos
4 NO -0.0527 Saints
5 CAR 0.111 Panthers
6 CHI 0.132 Bears
7 CAR 0.0481 Panthers
8 IND 0.412 Colts
9 HOU 0.0425 Texans
10 ARI 0.200 Cardinals
# … with 99,990 more rows
toc()
0.243 sec elapsed
We can use a few tricks in RStudio to make this easier!
We’ll use “multiple cursors” in RStudio and combine that with the amazing {datapasta}
package from Miles McBain.
datapasta
is about reducing resistance associated with copying and pasting data to and from R. It is a response to the realisation that I often found myself using intermediate programs like Sublime to munge text into suitable formats. Addins and functions indatapasta
support a wide variety of input and output situations, so it (probably) “just works”. Hopefully tools in this package will remove such intermediate steps and associated frustrations from our data slinging workflows.
datapasta
We’ll first create a character vector via datapasta
, using the below code:
ex_df %>%
distinct(team) %>%
arrange(team) %>%
pull() %>%
datapasta::vector_paste_vertical()
I’ve shown a video of me stepping through this code since there are a lot of different options for datapasta
and its outputs.
This gives us a nice starting point of all the vector items and we are ready to work with them in our script.
I’ve done this as a video as it’s fairly complex to “write about”, but very easy to show! The video describes the process of enabling multiple selections in RStudio.
Note that the enabling multi-line selection is done in RStudio via:
Once we have this “skeleton” quickly created we can then add the individual “match outputs”, like “Cardinals” for “ARI”, “Falcons” for “ATL”, etc, but we’ve saved ourself the need to type out a lot of the repeated “skeleton” of the case_when()
.
While the above method is pretty quick to create and understand, we still had to type quite a bit (which is decreased if you use multiple cursors), and in many situations a true join is more efficient in terms of typing, “speed of thought”, and execution time.
team_join <- all_teams %>%
select(team_name, team = team_short_name)
tic()
left_join(ex_df, team_join, by = "team")
# A tibble: 100,000 x 3
team stat team_name
<chr> <dbl> <chr>
1 HOU 0.0593 Texans
2 JAX -0.0534 Jaguars
3 DEN -0.0488 Broncos
4 NO -0.0527 Saints
5 CAR 0.111 Panthers
6 CHI 0.132 Bears
7 CAR 0.0481 Panthers
8 IND 0.412 Colts
9 HOU 0.0425 Texans
10 ARI 0.200 Cardinals
# … with 99,990 more rows
toc()
0.026 sec elapsed
Now you may say, well of course I could just use a join if I have a perfect dataset, what do I do with data cleaning?! I would say that a while ago this was a revelation as I typically only would move to key-value pairs + joins when it was obvious, and sometimes I’d end up using a case when solution when a join was easily possible.
So, keep in mind that we can often “create” a joining dataset and use it pretty easily!
tribble
We can use tribble
to use almost the exact same setup as the case_when()
. We can even use the same multiple cursor workflow to take alter our existing case_when()
code and turn it into the below tribble
setup.
tribble(
~team, ~team_name,
"ARI", "Cardinals",
"ATL", "Falcons",
"BAL", "Ravens",
"BUF", "Bills",
"CAR", "Panthers",
"CHI", "Bears",
"CIN", "Bengals",
"CLE", "Browns",
"DAL", "Cowboys",
"DEN", "Broncos",
"DET", "Lions",
"GB", "Packers",
"HOU", "Texans",
"IND", "Colts",
"JAX", "Jaguars",
"KC", "Chiefs",
"LV", "Raiders",
"LAC", "Chargers",
"LAR", "Rams",
"MIA", "Dolphins",
"MIN", "Vikings",
"NE", "Patriots",
"NO", "Saints",
"NYG", "Giants",
"NYJ", "Jets",
"PHI", "Eagles",
"PIT", "Steelers",
"SF", "49ers",
"SEA", "Seahawks",
"TB", "Buccaneers",
"TEN", "Titans",
"WSH", NA_character_
)
# A tibble: 32 x 2
team team_name
<chr> <chr>
1 ARI Cardinals
2 ATL Falcons
3 BAL Ravens
4 BUF Bills
5 CAR Panthers
6 CHI Bears
7 CIN Bengals
8 CLE Browns
9 DAL Cowboys
10 DEN Broncos
# … with 22 more rows
You could also just create a more traditional data.frame
/tibble
, and do the same thing without having to “align” all the rows visually. I do find it a bit easier to look at it row-by-row in the tribble
so that I can sanity check myself as I go, but up to the reader to decide!
ex_df %>%
distinct(team) %>%
arrange(team) %>%
mutate(
team_name =
c(
"Cardinals", "Falcons", "Ravens", "Bills", "Panthers", "Bears",
"Bengals", "Browns", "Cowboys", "Broncos", "Lions", "Packers", "Texans",
"Colts", "Jaguars", "Chiefs", "Raiders", "Chargers", "Rams", "Dolphins",
"Vikings", "Patriots", "Saints", "Giants", "Jets", "Eagles", "Steelers",
"49ers", "Seahawks", "Buccaneers", "Titans", NA_character_
)
)
# A tibble: 32 x 2
team team_name
<chr> <chr>
1 ARI Cardinals
2 ATL Falcons
3 BAL Ravens
4 BUF Bills
5 CAR Panthers
6 CHI Bears
7 CIN Bengals
8 CLE Browns
9 DAL Cowboys
10 DEN Broncos
# … with 22 more rows
data.table::fcase
Now there is also an equivalent function to case_when()
in data.table
called fcase
(short for fast case when) as of late 2019. Since it operates on any vector, you can use it as a drop in replacement for most in-memory cases of case_when()
in dplyr
, or use it in native data.table
objects. The expectation is that it will operate similarly in output but be more efficient with regards to memory/time.
Note, you do not have the ability to push the translation of fcase()
into SQL though, as dbplyr
only has SQL-translation for case_when()
.
The syntax of fcase()
and case_when()
are essentially the same, but rather than a formula-syntax for assignment via ~
, it relies on another comma to assign by position.
Compare the syntax below:
library(data.table)
mtcars %>%
tibble() %>%
select(mpg, cyl) %>%
mutate(
cyl_case_when = case_when(
# relies on `~` to indicate assign
# read as if cyl is equal to 4 assign "Four"
cyl == 4 ~ "Four",
cyl == 6 ~ "Six",
cyl == 8 ~ "Eight",
TRUE ~ NA_character_
),
cyl_fcase = fcase(
# relies on another `,` to assign by position
# reas as if cyl is equal to 4, "Four"
cyl == 4, "Four",
cyl == 6, "Six",
cyl == 8, "Eight",
TRUE, NA_character_
),
cyl_match = cyl_case_when == cyl_fcase
)
# A tibble: 32 x 5
mpg cyl cyl_case_when cyl_fcase cyl_match
<dbl> <dbl> <chr> <chr> <lgl>
1 21 6 Six Six TRUE
2 21 6 Six Six TRUE
3 22.8 4 Four Four TRUE
4 21.4 6 Six Six TRUE
5 18.7 8 Eight Eight TRUE
6 18.1 6 Six Six TRUE
7 14.3 8 Eight Eight TRUE
8 24.4 4 Four Four TRUE
9 22.8 4 Four Four TRUE
10 19.2 6 Six Six TRUE
# … with 22 more rows
Now, I have converted our example with the case_when()
statement into fcase()
for both dplyr
and data.table
, but have hidden the code in the detail box below since it’s almost identical and would be quite repetitive. I have printed a test to see if the outcomes are equivalent though between native data.table
and dplyr
, and they are! 🥳
Really happy to have the option of using some various methods in-memory via fcase()
and/or case_when()
and optionally if working with REALLY large data push some of that compute into the database via dbplyr
+ case_when()
.
tic()
dplyr_fcase <- ex_df %>%
mutate(
team_name = fcase(
team == "ARI", "Cardinals",
team == "ATL", "Falcons",
team == "BAL", "Ravens",
team == "BUF", "Bills",
team == "CAR", "Panthers",
team == "CHI", "Bears",
team == "CIN", "Bengals",
team == "CLE", "Browns",
team == "DAL", "Cowboys",
team == "DEN", "Broncos",
team == "DET", "Lions",
team == "GB" ,"Packers",
team == "HOU", "Texans",
team == "IND", "Colts",
team == "JAX", "Jaguars",
team == "KC" ,"Chiefs",
team == "LV" ,"Raiders",
team == "LAC", "Chargers",
team == "LAR", "Rams",
team == "MIA", "Dolphins",
team == "MIN", "Vikings",
team == "NE" ,"Patriots",
team == "NO" ,"Saints",
team == "NYG", "Giants",
team == "NYJ", "Jets",
team == "PHI", "Eagles",
team == "PIT", "Steelers",
team == "SF" ,"49ers",
team == "SEA", "Seahawks",
team == "TB" ,"Buccaneers",
team == "TEN", "Titans",
team == "WSH", NA_character_,
TRUE, NA_character_
)
)
toc()
0.05 sec elapsed
# data.table native
dt_fcase <- data.table(ex_df)[, team_name := fcase(
team == "ARI", "Cardinals",
team == "ATL", "Falcons",
team == "BAL", "Ravens",
team == "BUF", "Bills",
team == "CAR", "Panthers",
team == "CHI", "Bears",
team == "CIN", "Bengals",
team == "CLE", "Browns",
team == "DAL", "Cowboys",
team == "DEN", "Broncos",
team == "DET", "Lions",
team == "GB" ,"Packers",
team == "HOU", "Texans",
team == "IND", "Colts",
team == "JAX", "Jaguars",
team == "KC" ,"Chiefs",
team == "LV" ,"Raiders",
team == "LAC", "Chargers",
team == "LAR", "Rams",
team == "MIA", "Dolphins",
team == "MIN", "Vikings",
team == "NE" ,"Patriots",
team == "NO" ,"Saints",
team == "NYG", "Giants",
team == "NYJ", "Jets",
team == "PHI", "Eagles",
team == "PIT", "Steelers",
team == "SF" ,"49ers",
team == "SEA", "Seahawks",
team == "TB" ,"Buccaneers",
team == "TEN", "Titans",
team == "WSH", NA_character_,
TRUE, NA_character_
)] %>%
as_tibble()
all_equal(dplyr_fcase, dt_fcase)
[1] TRUE
I’m going to preface this by also saying that this is a particular straight forward problem to translate into a join
, and it has 32 total comparisons. Most uses of case_when()
or fcase()
will be more traditional and/or complex logic problems and be much fewer than 30!!!
Now with that being said, you may still say, well I’m not convinced that join()
method is any easier or faster to code out for humans, and we’re all free to make our own decisions! I do want to note that case_when()
gets memory-inefficient much faster than ???_join()
.
We can really highlight the differences in the execution time and the memory allocated via the {bench}
package. I’m using bench
to execute the two different methods 3x time, compare the timing and the memory used, along with some other stats.
In the expandable section below we have a repeat of our above left_join()
and case_when()
calls.
join_expr <- function(){
left_join(ex_df, team_join, by = "team")
}
case_when_expr <- function(){
ex_df %>%
mutate(
team_name = case_when(
team == "ARI" ~ "Cardinals",
team == "ATL" ~ "Falcons",
team == "BAL" ~ "Ravens",
team == "BUF" ~ "Bills",
team == "CAR" ~ "Panthers",
team == "CHI" ~ "Bears",
team == "CIN" ~ "Bengals",
team == "CLE" ~ "Browns",
team == "DAL" ~ "Cowboys",
team == "DEN" ~ "Broncos",
team == "DET" ~ "Lions",
team == "GB" ~ "Packers",
team == "HOU" ~ "Texans",
team == "IND" ~ "Colts",
team == "JAX" ~ "Jaguars",
team == "KC" ~ "Chiefs",
team == "LV" ~ "Raiders",
team == "LAC" ~ "Chargers",
team == "LAR" ~ "Rams",
team == "MIA" ~ "Dolphins",
team == "MIN" ~ "Vikings",
team == "NE" ~ "Patriots",
team == "NO" ~ "Saints",
team == "NYG" ~ "Giants",
team == "NYJ" ~ "Jets",
team == "PHI" ~ "Eagles",
team == "PIT" ~ "Steelers",
team == "SF" ~ "49ers",
team == "SEA" ~ "Seahawks",
team == "TB" ~ "Buccaneers",
team == "TEN" ~ "Titans",
team == "WSH" ~ NA_character_,
TRUE ~ NA_character_
)
)
}
We can then compare their execution multiple times with the bench
package. This will vary by the execution, but with 3 iterations and 100,000 rows, I have seen about a 10x speed improvement in left_join
vs case_when
. Note that in most cases this is still pretty much instantaneous in “human time”.
However, do note that we use about 28-30 times more memory for the case_when()
statement.
bench::mark(
min_time = 0.1,
max_iterations = 3,
min_iterations = 3,
case_when_expr(),
join_expr()
) %>%
select(expression, min, median, mem_alloc, n_itr)
# A tibble: 2 x 4
expression min median mem_alloc
<bch:expr> <bch:tm> <bch:tm> <bch:byt>
1 case_when_expr() 154.93ms 217.7ms 150.81MB
2 join_expr() 9.79ms 13.7ms 5.34MB
The 150 Mb of memory used, may not seem like a lot with smaller datasets (100,000), but if we were to bump this up to 1,000,000 rows we see a similar 10x increase of memory. Specifically we are now up to ~1.5 Gb of memory used for case_when()
. So if you’re going beyond the millions or 10s of millions of rows AND are trying to do a LOT of comparisons, probably a good idea to start refactoring into a join
if possible!
# 1,000,000 rows as 1e6
sample_size <- 1e6
set.seed(829)
ex_df <- tibble(
team = sample(all_teams$team_short_name, size = sample_size, replace = TRUE),
stat = rnorm(sample_size, mean = 0.1, sd = 0.1)
)
bench::mark(
min_time = 0.1,
max_iterations = 3,
min_iterations = 3,
case_when_expr(),
join_expr()
) %>%
select(expression, min, median, mem_alloc, n_itr)
# A tibble: 2 x 4
expression min median mem_alloc
<bch:expr> <bch:tm> <bch:tm> <bch:byt>
1 case_when_expr() 1.3s 1.36s 1.47GB
2 join_expr() 71.3ms 74.53ms 53.41MB
There are situations where a join
doesn’t solve the problem, so we can go one step further and add in our data.table::fcase()
or dtplyr
-translation of dplyr
into data.table
. All the example reprex code is in an expandable chunk below.
library(dtplyr)
case_when_dplyr <- function(){
ex_df %>%
mutate(
team_name = case_when(
team == "ARI" ~ "Cardinals",
team == "ATL" ~ "Falcons",
team == "BAL" ~ "Ravens",
team == "BUF" ~ "Bills",
team == "CAR" ~ "Panthers",
team == "CHI" ~ "Bears",
team == "CIN" ~ "Bengals",
team == "CLE" ~ "Browns",
team == "DAL" ~ "Cowboys",
team == "DEN" ~ "Broncos",
team == "DET" ~ "Lions",
team == "GB" ~ "Packers",
team == "HOU" ~ "Texans",
team == "IND" ~ "Colts",
team == "JAX" ~ "Jaguars",
team == "KC" ~ "Chiefs",
team == "LV" ~ "Raiders",
team == "LAC" ~ "Chargers",
team == "LAR" ~ "Rams",
team == "MIA" ~ "Dolphins",
team == "MIN" ~ "Vikings",
team == "NE" ~ "Patriots",
team == "NO" ~ "Saints",
team == "NYG" ~ "Giants",
team == "NYJ" ~ "Jets",
team == "PHI" ~ "Eagles",
team == "PIT" ~ "Steelers",
team == "SF" ~ "49ers",
team == "SEA" ~ "Seahawks",
team == "TB" ~ "Buccaneers",
team == "TEN" ~ "Titans",
team == "WSH" ~ NA_character_,
TRUE ~ NA_character_
)
)
}
join_dplyr <- function(){
left_join(ex_df, team_join, by = "team")
}
join_dtplyr <- function(){
dt_ex_df <- lazy_dt(ex_df)
team_join <- lazy_dt(team_join)
dt_ex_df %>%
left_join(team_join, by = "team") %>%
select(team, stat, team_name) %>%
as_tibble()
}
case_when_dtplyr <- function(){
lazy_dt(ex_df) %>%
mutate(
team_name = case_when(
team == "ARI" ~ "Cardinals",
team == "ATL" ~ "Falcons",
team == "BAL" ~ "Ravens",
team == "BUF" ~ "Bills",
team == "CAR" ~ "Panthers",
team == "CHI" ~ "Bears",
team == "CIN" ~ "Bengals",
team == "CLE" ~ "Browns",
team == "DAL" ~ "Cowboys",
team == "DEN" ~ "Broncos",
team == "DET" ~ "Lions",
team == "GB" ~ "Packers",
team == "HOU" ~ "Texans",
team == "IND" ~ "Colts",
team == "JAX" ~ "Jaguars",
team == "KC" ~ "Chiefs",
team == "LV" ~ "Raiders",
team == "LAC" ~ "Chargers",
team == "LAR" ~ "Rams",
team == "MIA" ~ "Dolphins",
team == "MIN" ~ "Vikings",
team == "NE" ~ "Patriots",
team == "NO" ~ "Saints",
team == "NYG" ~ "Giants",
team == "NYJ" ~ "Jets",
team == "PHI" ~ "Eagles",
team == "PIT" ~ "Steelers",
team == "SF" ~ "49ers",
team == "SEA" ~ "Seahawks",
team == "TB" ~ "Buccaneers",
team == "TEN" ~ "Titans",
team == "WSH" ~ NA_character_,
TRUE ~ NA_character_
)
) %>%
as_tibble()
}
fcase_dplyr <- function(){
ex_df %>%
mutate(
team_name = fcase(
team == "ARI", "Cardinals",
team == "ATL", "Falcons",
team == "BAL", "Ravens",
team == "BUF", "Bills",
team == "CAR", "Panthers",
team == "CHI", "Bears",
team == "CIN", "Bengals",
team == "CLE", "Browns",
team == "DAL", "Cowboys",
team == "DEN", "Broncos",
team == "DET", "Lions",
team == "GB" ,"Packers",
team == "HOU", "Texans",
team == "IND", "Colts",
team == "JAX", "Jaguars",
team == "KC" ,"Chiefs",
team == "LV" ,"Raiders",
team == "LAC", "Chargers",
team == "LAR", "Rams",
team == "MIA", "Dolphins",
team == "MIN", "Vikings",
team == "NE" ,"Patriots",
team == "NO" ,"Saints",
team == "NYG", "Giants",
team == "NYJ", "Jets",
team == "PHI", "Eagles",
team == "PIT", "Steelers",
team == "SF" ,"49ers",
team == "SEA", "Seahawks",
team == "TB" ,"Buccaneers",
team == "TEN", "Titans",
team == "WSH", NA_character_
)
)
}
fcase_dt_native <- function(){
data.table(ex_df)[, team_name := fcase(
team == "ARI", "Cardinals",
team == "ATL", "Falcons",
team == "BAL", "Ravens",
team == "BUF", "Bills",
team == "CAR", "Panthers",
team == "CHI", "Bears",
team == "CIN", "Bengals",
team == "CLE", "Browns",
team == "DAL", "Cowboys",
team == "DEN", "Broncos",
team == "DET", "Lions",
team == "GB" ,"Packers",
team == "HOU", "Texans",
team == "IND", "Colts",
team == "JAX", "Jaguars",
team == "KC" ,"Chiefs",
team == "LV" ,"Raiders",
team == "LAC", "Chargers",
team == "LAR", "Rams",
team == "MIA", "Dolphins",
team == "MIN", "Vikings",
team == "NE" ,"Patriots",
team == "NO" ,"Saints",
team == "NYG", "Giants",
team == "NYJ", "Jets",
team == "PHI", "Eagles",
team == "PIT", "Steelers",
team == "SF" ,"49ers",
team == "SEA", "Seahawks",
team == "TB" ,"Buccaneers",
team == "TEN", "Titans",
team == "WSH", NA_character_
)] %>%
as_tibble()
}
join_dt_native <- function(){
data.table(ex_df)[data.table(team_join), on = .(team), team_name := team_name] %>%
as_tibble()
}
Finally we can check the timing/memory usage for all of the combos. In short, join()
-methods are fastest and use the least memory, fcase()
whether in native data.table
or dplyr
is a bit slower/more memory than join
but still ~5x faster/more memory efficient than case_when()
, and case_when()
is the slowest/most memory hungry (but translates into SQL if needed).
Overall, I’m really happy that we have the ability to mix and match functions from various awesome packages depending on the problems we’re trying to solve!
bench::mark(
min_time = 0.1,
max_iterations = 3,
min_iterations = 3,
case_when_dplyr(),
case_when_dtplyr(),
fcase_dplyr(),
fcase_dt_native(),
join_dplyr(),
join_dt_native(),
check = FALSE
) %>%
select(expression, min, median, mem_alloc, n_itr)
# A tibble: 6 x 4
expression min median mem_alloc
<bch:expr> <bch:tm> <bch:tm> <bch:byt>
1 case_when_dplyr() 1.46s 1.49s 1.47GB
2 case_when_dtplyr() 1.49s 1.54s 1.51GB
3 fcase_dplyr() 320.37ms 324.42ms 133.75MB
4 fcase_dt_native() 310.85ms 313.07ms 164.42MB
5 join_dplyr() 76.48ms 80.98ms 53.41MB
6 join_dt_native() 76.21ms 87.23ms 61.63MB
For attribution, please cite this work as
Mock (2021, Feb. 13). The Mockup Blog: Joins vs case whens - speed and memory tradeoffs. Retrieved from https://themockup.blog/posts/2021-02-13-joins-vs-casewhen-speed-and-memory-tradeoffs/
BibTeX citation
@misc{mock2021joins, author = {Mock, Thomas}, title = {The Mockup Blog: Joins vs case whens - speed and memory tradeoffs}, url = {https://themockup.blog/posts/2021-02-13-joins-vs-casewhen-speed-and-memory-tradeoffs/}, year = {2021} }