Spoiler - joins are fastest, and can be easy to prep!
tidyverse
data cleaning
joins
Author
Thomas Mock
Published
February 13, 2021
Matching names
There are many times where you may have a reference dataset that has some type of short-code, abbreviations, nicknames, or even misspellings. In this situation we want to create a new column or even overwrite the existing column with the correct spelling, the full name, etc.
While there isn’t a wrong way to do this, I’ve made mistakes in the past of making things more complicated for myself during data cleaning/prep steps. Most importantly to me in my data cleaning steps are “speed of thought” - aka how long it takes me to type out or reason through, but there are also tradeoffs in memory usage and in execution speed.
I’m going to step through some strategies, honed by years of TidyTuesday data cleaning and prep!
In general I want to preface all these examples with the fact that this exercise is easily translatable to various join options. Joins are great, but there are situations where the logicals aren’t simple matches/index (like x == "val") but rather more complex queries (like x > 5 & var %in% c("val", "val2")), but I’m going to live within the constraints of my example use case. Also note that joins could join on more than one key or add as many variables as needed rather than just one, so there’s additional complexity possible here.
To avoid burying the lede, joins are by far the fastest and most memory-efficient method if you can fit your problem to that solution, data.table::fcase() provides a easy to use and remarkably fast/memory-efficient intermediate method that works in base R, data.tables or dplyr, and dplyr::case_when() is the slowest/most-memory hungry for in-memory cases, BUT is probably fast enough for relatively small data (< 1,000,000 rows) and is the option that allows pushing compute into a remote backend via dbplyr/sparklyr.
Example Data
We’re going to simulate a dataset with NFL team abbreviations and a “stat” that is relatively in scope with something like EPA, but again this is all simulated data.
library(tidyverse)# relatively nice sized dataset (100,000 rows)sample_size<-100000# Get example team names from espnall_teams<-espnscrapeR::get_nfl_teams()
Getting NFL teams!
# set a reproducible seed (h/t to Anthony's wife's new $4 luggage...)set.seed(829)ex_df<-tibble(
team =sample(all_teams$team_abb, size =sample_size, replace =TRUE),
stat =rnorm(sample_size, mean =0.1, sd =0.1))ex_df
# A tibble: 100,000 × 2
team stat
<chr> <dbl>
1 HOU 0.0593
2 JAX -0.0534
3 DEN -0.0488
4 NO -0.0527
5 CAR 0.111
6 CHI 0.132
7 CAR 0.0481
8 IND 0.412
9 HOU 0.0425
10 ARI 0.200
# … with 99,990 more rows
dplyr::case_when() solution
There’s a fantastic function in the dplyr package called case_when().
This function allows you to vectorise multiple if_else()
In essence, this provides the ability to provide many logical statements to generate a specific outcome. You can read more about the syntax via: ?dplyr::case_when() in the R console or the docs. This functions works on vectors as well as in mutations inside data.frames/tibbles/data.tables. Lastly, it can also be translated into SQL automatically via dbplyr in situations where you don’t want to bring all the data into memory.
case_when is very readable to me, and feels better than a whole bundle of nested ifelse/if_else statements.
ex_car<-mpg%>%select(model, cyl, cty, hwy)%>%slice_sample(n =100)ex_car%>%mutate(
cyl_character =case_when(# if logical statement is TRUE then assign new valuecyl==4~"Four",
cyl==6~"Six",
cyl==8~"Eight",
# This last statement means, if none of the above conditions are met# then return a NA value, note that case_when is type-safe# so you'll want to use the various NA_??? for the type of outputTRUE~NA_character_))
# A tibble: 100 × 5
model cyl cty hwy cyl_character
<chr> <int> <int> <int> <chr>
1 a4 4 20 31 Four
2 malibu 4 22 30 Four
3 ram 1500 pickup 4wd 8 9 12 Eight
4 jetta 5 21 29 <NA>
5 passat 4 18 29 Four
6 camry solara 6 18 26 Six
7 f150 pickup 4wd 8 13 16 Eight
8 a4 quattro 6 17 25 Six
9 grand cherokee 4wd 8 14 19 Eight
10 jetta 4 22 29 Four
# … with 90 more rows
Now that we understand the syntax a bit better, let’s try it on our dataset of interest! Remember, we want to take our team abbreviations and add the team names into the dataset. So we’ll need to provide a logical match of each team to the correct abbreviation and output the corresponding full name.
This works out very well and we got all the names in the right spot, and with the basic timing via tictoc we can see it was still fast. However, there’s a LOT of typing, and more importantly a lot of repetitive typing to get this done.
# A tibble: 100,000 × 3
team stat team_name
<chr> <dbl> <chr>
1 HOU 0.0593 Texans
2 JAX -0.0534 Jaguars
3 DEN -0.0488 Broncos
4 NO -0.0527 Saints
5 CAR 0.111 Panthers
6 CHI 0.132 Bears
7 CAR 0.0481 Panthers
8 IND 0.412 Colts
9 HOU 0.0425 Texans
10 ARI 0.200 Cardinals
# … with 99,990 more rows
We can use a few tricks in RStudio to make this easier!
We’ll use “multiple cursors” in RStudio and combine that with the amazing {datapasta} package from Miles McBain.
datapasta is about reducing resistance associated with copying and pasting data to and from R. It is a response to the realisation that I often found myself using intermediate programs like Sublime to munge text into suitable formats. Addins and functions in datapasta support a wide variety of input and output situations, so it (probably) “just works”. Hopefully tools in this package will remove such intermediate steps and associated frustrations from our data slinging workflows.
datapasta is an absolute life saver for converting objects like vectors/dataframes/clipboard contents and turning them into editable formats in a script.
Step 1: datapasta
We’ll first create a character vector via datapasta, using the below code:
I’ve shown a video of me stepping through this code since there are a lot of different options for datapasta and its outputs.
This gives us a nice starting point of all the vector items and we are ready to work with them in our script.
Step 2: Multiple line selection in RStudio
I’ve done this as a video as it’s fairly complex to “write about”, but very easy to show! The video describes the process of enabling multiple selections in RStudio.
Note that the enabling multi-line selection is done in RStudio via:
Opt key + Left Mouse Button drag on Mac
Alt key + Left Mouse Button drag on Windows
Once we have this “skeleton” quickly created we can then add the individual “match outputs”, like “Cardinals” for “ARI”, “Falcons” for “ATL”, etc, but we’ve saved ourself the need to type out a lot of the repeated “skeleton” of the case_when().
A join solution
While the above method is pretty quick to create and understand, we still had to type quite a bit (which is decreased if you use multiple cursors), and in many situations a true join is more efficient in terms of typing, “speed of thought”, and execution time.
team_join<-all_teams%>%select(team_name, team =team_abb)tic()left_join(ex_df, team_join, by ="team")
# A tibble: 100,000 × 3
team stat team_name
<chr> <dbl> <chr>
1 HOU 0.0593 Texans
2 JAX -0.0534 Jaguars
3 DEN -0.0488 Broncos
4 NO -0.0527 Saints
5 CAR 0.111 Panthers
6 CHI 0.132 Bears
7 CAR 0.0481 Panthers
8 IND 0.412 Colts
9 HOU 0.0425 Texans
10 ARI 0.200 Cardinals
# … with 99,990 more rows
Now you may say, well of course I could just use a join if I have a perfect dataset, what do I do with data cleaning?! I would say that a while ago this was a revelation as I typically only would move to key-value pairs + joins when it was obvious, and sometimes I’d end up using a case when solution when a join was easily possible.
So, keep in mind that we can often “create” a joining dataset and use it pretty easily!
tribble
We can use tribble to use almost the exact same setup as the case_when(). We can even use the same multiple cursor workflow to take alter our existing case_when() code and turn it into the below tribble setup.
# A tibble: 32 × 2
team team_name
<chr> <chr>
1 ARI Cardinals
2 ATL Falcons
3 BAL Ravens
4 BUF Bills
5 CAR Panthers
6 CHI Bears
7 CIN Bengals
8 CLE Browns
9 DAL Cowboys
10 DEN Broncos
# … with 22 more rows
You could also just create a more traditional data.frame/tibble, and do the same thing without having to “align” all the rows visually. I do find it a bit easier to look at it row-by-row in the tribble so that I can sanity check myself as I go, but up to the reader to decide!
# A tibble: 32 × 2
team team_name
<chr> <chr>
1 ARI Cardinals
2 ATL Falcons
3 BAL Ravens
4 BUF Bills
5 CAR Panthers
6 CHI Bears
7 CIN Bengals
8 CLE Browns
9 DAL Cowboys
10 DEN Broncos
# … with 22 more rows
data.table::fcase
Now there is also an equivalent function to case_when() in data.table called fcase (short for fast case when) as of late 2019. Since it operates on any vector, you can use it as a drop in replacement for most in-memory cases of case_when() in dplyr, or use it in native data.table objects. The expectation is that it will operate similarly in output but be more efficient with regards to memory/time.
data.table actually provides all sorts of cool optimized functions like this that work anywhere, and you can generally assume a “f” in front of an existing base R function name for the “faster” version of it. See fifelse, rleid, fsetdiff for example.
Note, you do not have the ability to push the translation of fcase() into SQL though, as dbplyr only has SQL-translation for case_when().
The syntax of fcase() and case_when() are essentially the same, but rather than a formula-syntax for assignment via ~, it relies on another comma to assign by position.
Compare the syntax below:
library(data.table)mtcars%>%tibble()%>%select(mpg, cyl)%>%mutate(
cyl_case_when =case_when(# relies on `~` to indicate assign# read as if cyl is equal to 4 assign "Four"cyl==4~"Four",
cyl==6~"Six",
cyl==8~"Eight",
TRUE~NA_character_),
cyl_fcase =fcase(# relies on another `,` to assign by position# reas as if cyl is equal to 4, "Four"cyl==4, "Four",
cyl==6, "Six",
cyl==8, "Eight",
TRUE, NA_character_),
cyl_match =cyl_case_when==cyl_fcase)
# A tibble: 32 × 5
mpg cyl cyl_case_when cyl_fcase cyl_match
<dbl> <dbl> <chr> <chr> <lgl>
1 21 6 Six Six TRUE
2 21 6 Six Six TRUE
3 22.8 4 Four Four TRUE
4 21.4 6 Six Six TRUE
5 18.7 8 Eight Eight TRUE
6 18.1 6 Six Six TRUE
7 14.3 8 Eight Eight TRUE
8 24.4 4 Four Four TRUE
9 22.8 4 Four Four TRUE
10 19.2 6 Six Six TRUE
# … with 22 more rows
Now, I have converted our example with the case_when() statement into fcase() for both dplyr and data.table, but have hidden the code in the detail box below since it’s almost identical and would be quite repetitive. I have printed a test to see if the outcomes are equivalent though between native data.table and dplyr, and they are! 🥳
Really happy to have the option of using some various methods in-memory via fcase() and/or case_when() and optionally if working with REALLY large data push some of that compute into the database via dbplyr + case_when().
I’m going to preface this by also saying that this is a particular straight forward problem to translate into a join, and it has 32 total comparisons. Most uses of case_when() or fcase() will be more traditional and/or complex logic problems and be much fewer than 30!!!
Now with that being said, you may still say, well I’m not convinced that join() method is any easier or faster to code out for humans, and we’re all free to make our own decisions! I do want to note that case_when() gets memory-inefficient much faster than ???_join().
We can really highlight the differences in the execution time and the memory allocated via the bench package. I’m using bench to execute the two different methods 3x time, compare the timing and the memory used, along with some other stats.
In the expandable section below we have a repeat of our above left_join() and case_when() calls.
We can then compare their execution multiple times with the bench package. This will vary by the execution, but with 3 iterations and 100,000 rows, I have seen about a 10x speed improvement in left_join vs case_when. Note that in most cases this is still pretty much instantaneous in “human time”.
Visual reaction time in young adults is in the range of about 250 ms per Jain et al.
However, do note that we use about 28-30 times more memory for the case_when() statement.
# A tibble: 2 × 4
expression min median mem_alloc
<bch:expr> <bch:tm> <bch:tm> <bch:byt>
1 case_when_expr() 66.02ms 99.43ms 150.82MB
2 join_expr() 5.07ms 5.15ms 5.37MB
The 150 Mb of memory used, may not seem like a lot with smaller datasets (100,000), but if we were to bump this up to 1,000,000 rows we see a similar 10x increase of memory. Specifically we are now up to ~1.5 Gb of memory used for case_when(). So if you’re going beyond the millions or 10s of millions of rows AND are trying to do a LOT of comparisons, probably a good idea to start refactoring into a join if possible!
# 1,000,000 rows as 1e6sample_size<-1e6set.seed(829)ex_df<-tibble(
team =sample(all_teams$team_abb, size =sample_size, replace =TRUE),
stat =rnorm(sample_size, mean =0.1, sd =0.1))bench::mark(
min_time =0.1,
max_iterations =3,
min_iterations =3,
case_when_expr(),
join_expr())%>%select(expression, min, median, mem_alloc, n_itr)
# A tibble: 2 × 4
expression min median mem_alloc
<bch:expr> <bch:tm> <bch:tm> <bch:byt>
1 case_when_expr() 595.8ms 636.3ms 1.47GB
2 join_expr() 42.7ms 45.9ms 53.43MB
There are situations where a join doesn’t solve the problem, so we can go one step further and add in our data.table::fcase() or dtplyr-translation of dplyr into data.table. All the example reprex code is in an expandable chunk below.
Finally we can check the timing/memory usage for all of the combos. In short, join()-methods are fastest and use the least memory, fcase() whether in native data.table or dplyr is a bit slower/more memory than join but still ~5x faster/more memory efficient than case_when(), and case_when() is the slowest/most memory hungry (but translates into SQL if needed).
Overall, I’m really happy that we have the ability to mix and match functions from various awesome packages depending on the problems we’re trying to solve!