data.table::rleid() is pretty cool!

Author

Tom Mock

Published

November 27, 2022

Longer example on QB Starts

Create a dataframe

── Attaching packages ────────────────────────────────── tidyverse 1.3.2.9000 ──
✔ ggplot2   3.4.0           ✔ dplyr     1.0.99.9000
✔ tibble    3.1.8           ✔ stringr   1.4.1      
✔ tidyr     1.2.1           ✔ forcats   0.5.1      
✔ readr     2.1.3           ✔ lubridate 1.8.0      
✔ purrr     0.3.5           
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose
# create a df of"streaks" or repeats
ex_df <- tibble(
  x = c("a", "a", rep("b", 3), rep("a", 5)),
  num = 1:10
) 

# print the data
ex_df
# A tibble: 10 × 2
   x       num
   <chr> <int>
 1 a         1
 2 a         2
 3 b         3
 4 b         4
 5 b         5
 6 a         6
 7 a         7
 8 a         8
 9 a         9
10 a        10

Example of rle or run-length encoding

# rle or run-length encoding
# summarizes a vector into the length of each repeat
# and the value that is repeated
# technically this is a form of recoverable data compression
# IE you end up with fewer bytes but it tells you what a long vector
# could be, and can be recreated

# this can be read as the betters a, b, a
# where the first a is repeated 2x
# the b is repeated 3x
# the next a is repeated 5x
rle(ex_df$x)
Run Length Encoding
  lengths: int [1:3] 2 3 5
  values : chr [1:3] "a" "b" "a"

Example of rleid

# rleid() generates the ids or repeated group of equal length
# to the original vector

ex_df$x
 [1] "a" "a" "b" "b" "b" "a" "a" "a" "a" "a"
data.table::rleid(ex_df$x)
 [1] 1 1 2 2 2 3 3 3 3 3
# it can be used on a vector, in a dataframe, in a datatable or a tibble
# note that it can be used within mutate() since it returns
# a vector of equal length, ie the number of rows is not changed
ex_df %>% 
  mutate(rleid = data.table::rleid(x))
# A tibble: 10 × 3
   x       num rleid
   <chr> <int> <int>
 1 a         1     1
 2 a         2     1
 3 b         3     2
 4 b         4     2
 5 b         5     2
 6 a         6     3
 7 a         7     3
 8 a         8     3
 9 a         9     3
10 a        10     3

rle is a summary function

# note that rle() is a _summary_ function, and generates fewer rows
ex_df %>% 
  summarize(lengths = rle(x)$lengths,
            values =rle(x)$values)
# A tibble: 3 × 2
  lengths values
    <int> <chr> 
1       2 a     
2       3 b     
3       5 a     

Recover the original data

# we can create a summary
# and then recover the original data
final_df <- ex_df %>% 
  summarize(
    lengths = rle(x)$lengths,
    values =rle(x)$values
    ) %T>% print() %>% 
  summarize(
    x = rep(values, times=lengths),
    num = 1:sum(lengths)
    )
# A tibble: 3 × 2
  lengths values
    <int> <chr> 
1       2 a     
2       3 b     
3       5 a     
final_df
# A tibble: 10 × 2
   x       num
   <chr> <int>
 1 a         1
 2 a         2
 3 b         3
 4 b         4
 5 b         5
 6 a         6
 7 a         7
 8 a         8
 9 a         9
10 a        10
# original and recreation are identical
all.equal(final_df, ex_df)
[1] TRUE
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.0 (2022-04-22)
 os       macOS Monterey 12.6
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Chicago
 date     2022-11-27
 pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
 quarto   1.2.269 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version     date (UTC) lib source
 data.table  * 1.14.3      2022-05-09 [1] Github (Rdatatable/data.table@e9a323d)
 dplyr       * 1.0.99.9000 2022-11-18 [1] Github (tidyverse/dplyr@0a55cf5)
 forcats     * 0.5.1       2021-01-27 [1] CRAN (R 4.2.0)
 ggplot2     * 3.4.0       2022-11-04 [1] CRAN (R 4.2.0)
 lubridate   * 1.8.0       2021-10-07 [1] CRAN (R 4.2.0)
 magrittr    * 2.0.3       2022-03-30 [1] CRAN (R 4.2.0)
 purrr       * 0.3.5       2022-10-06 [1] CRAN (R 4.2.0)
 readr       * 2.1.3       2022-10-01 [1] CRAN (R 4.2.0)
 sessioninfo * 1.2.2       2021-12-06 [1] CRAN (R 4.2.0)
 stringr     * 1.4.1       2022-08-20 [1] CRAN (R 4.2.0)
 tibble      * 3.1.8       2022-07-22 [1] CRAN (R 4.2.0)
 tidyr       * 1.2.1       2022-09-08 [1] CRAN (R 4.2.0)
 tidyverse   * 1.3.2.9000  2022-08-16 [1] Github (tidyverse/tidyverse@3be8283)

 [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────