Superseding dplyr's suffixed variants

Mark Andrews. October 27, 2020

The dplyr package describes itself as a “grammar of data manipulation” and its main features are its “verbs” like select, filter, mutate, and so on. Prior to version 1.0 of dplyr, the verbs had variants with the suffices _if, _at, _all. These extended the basic functionality of each of the dplyr verbs. As of 1.0, however, these suffixed variants have been superseded by another set of functions, often used in combination, including across, where, rename_with, if_any, if_all, and so on. Why the developers of dplyr chose to make these changes is described here. What this means in practice is that if you are already using dplyr, while you can continue using the suffixed variants, using across, where, etc is now the recommended way to accomplish that which was previously accomplished by these variants. If, on the other hand, you are new to dplyr, it is not worthwhile learning _if, _at, _all, other than to know that they were the previous way of doing things that are now accomplished with new functions.

In this post, I explain how we can use across, where, rename_with, etc., to perform actions that were previously accomplished with the suffixed variants of dplyr verbs. In addition, we will also cover a related function, c_across, which is used for rowwise operations in data-frames that were previously very awkward to accomplish in other ways. While I briefly summarize the main dplyr verbs in this post, I will assume that readers are already familiar, to at least minimal extent, with these verbs.

In this coverage, we will use the data frame blp_df (the csv file of which is here):

library(tidyverse)
blp_df <- read_csv("data/blp-trials-short.txt")
blp_df

## # A tibble: 1,000 x 7
##    participant lex   spell    resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

Replacing `select_if` with `select` and `where`

The select verb allows us to select columns in a variety of different ways such as their name, or position, or what characters their names begin or end with, etc. For example, here we select the columns that start with the string rt:

select(blp_df, starts_with('rt'))

## # A tibble: 1,000 x 2
##       rt rt.raw
##    <dbl>  <dbl>
##  1   977    977
##  2   565    565
##  3   562    562
##  4   572    572
##  5   659    659
##  6   538    538
##  7   626    626
##  8   566    566
##  9   922    922
## 10   555    555
## # … with 990 more rows

We can also select columns according to the properties of the values of the variable. For example, we can select those variables whose values are character strings, or numeric, or logical, etc. Previously, this was accomplished using select_if, whereby we selected a column if its values meet certain conditions. For example, to select the columns that are character vectors, we previously did the following:

select_if(blp_df, is.character)

## # A tibble: 1,000 x 3
##    lex   spell    resp 
##    <chr> <chr>    <chr>
##  1 N     staud    N    
##  2 N     dinbuss  N    
##  3 N     snilling N    
##  4 N     gancens  N    
##  5 W     filled   W    
##  6 W     journals W    
##  7 W     apache   W    
##  8 W     flake    W    
##  9 W     reliefs  W    
## 10 N     sarves   N    
## # … with 990 more rows

Now, we can accomplish this with where() as follows:

select(blp_df, where(is.character))

## # A tibble: 1,000 x 3
##    lex   spell    resp 
##    <chr> <chr>    <chr>
##  1 N     staud    N    
##  2 N     dinbuss  N    
##  3 N     snilling N    
##  4 N     gancens  N    
##  5 W     filled   W    
##  6 W     journals W    
##  7 W     apache   W    
##  8 W     flake    W    
##  9 W     reliefs  W    
## 10 N     sarves   N    
## # … with 990 more rows

When using where like this, we can also use custom functions like the following:

# return TRUE if `x` is numeric and its mean > 700.
has_high_mean <- function(x){
  is.numeric(x) && (mean(x, na.rm = TRUE) > 700)
}

select(blp_df, where(has_high_mean))

## # A tibble: 1,000 x 1
##    rt.raw
##     <dbl>
##  1    977
##  2    565
##  3    562
##  4    572
##  5    659
##  6    538
##  7    626
##  8    566
##  9    922
## 10    555
## # … with 990 more rows

As we can see, we select those columns for which the custom function returns TRUE.

Note that we can also use anonymous functions, such as purrr lambdas as described here, inside where. For example, we can accomplish the previous command with an anonymous function as follows:

select(blp_df, where(~is.numeric(.) && (mean(., na.rm = TRUE) > 700)))

## # A tibble: 1,000 x 1
##    rt.raw
##     <dbl>
##  1    977
##  2    565
##  3    562
##  4    572
##  5    659
##  6    538
##  7    626
##  8    566
##  9    922
## 10    555
## # … with 990 more rows

Replacing `rename_all`, `rename_at`, `rename_if` with `rename_with`

The rename verb allows us rename column names. For example, if we want to rename participant with subject, we do the following:

rename(blp_df, subject = participant)

## # A tibble: 1,000 x 7
##    subject lex   spell    resp     rt prev.rt rt.raw
##      <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1      20 N     staud    N       977     511    977
##  2       9 N     dinbuss  N       565     765    565
##  3      47 N     snilling N       562     496    562
##  4     103 N     gancens  N       572     656    572
##  5      45 W     filled   W       659     981    659
##  6      73 W     journals W       538    1505    538
##  7      24 W     apache   W       626     546    626
##  8      11 W     flake    W       566     717    566
##  9      32 W     reliefs  W       922    1471    922
## 10      96 N     sarves   N       555     806    555
## # … with 990 more rows

We can also use a renaming function that can be applied to all names. For example, we could use a function to convert all names to uppercase, or a function that replaces any occurrence of a dot with an underscore. Previously, this was accomplished with rename_all.

rename_all(blp_df, toupper) # convert all names to uppercase

## # A tibble: 1,000 x 7
##    PARTICIPANT LEX   SPELL    RESP     RT PREV.RT RT.RAW
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

rename_all(blp_df, ~str_replace_all(., '\\.', '_')) # replace dots with underscores

## # A tibble: 1,000 x 7
##    participant lex   spell    resp     rt prev_rt rt_raw
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

We can now accomplish this by using rename_with instead of rename_all.

rename_with(blp_df, toupper)

## # A tibble: 1,000 x 7
##    PARTICIPANT LEX   SPELL    RESP     RT PREV.RT RT.RAW
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

rename_with(blp_df, ~str_replace_all(., '\\.', '_'))

## # A tibble: 1,000 x 7
##    participant lex   spell    resp     rt prev_rt rt_raw
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

Admittedly, this appears like we have just renamed the function rename_all with rename_with. However, rename_with is more versatile than rename_all, and replaces other suffixed variants of rename such as rename_at and rename_if. As an example, let us say we want to replace the stringrt (reaction time) with time in the column names so that, for example, prev.rt is replaced with prev_time, etc. We could try rename_with as follows:

rename_with(blp_df,
            ~str_replace_all(., 'rt', 'time'))

## # A tibble: 1,000 x 7
##    patimeicipant lex   spell    resp   time prev.time time.raw
##            <dbl> <chr> <chr>    <chr> <dbl>     <dbl>    <dbl>
##  1            20 N     staud    N       977       511      977
##  2             9 N     dinbuss  N       565       765      565
##  3            47 N     snilling N       562       496      562
##  4           103 N     gancens  N       572       656      572
##  5            45 W     filled   W       659       981      659
##  6            73 W     journals W       538      1505      538
##  7            24 W     apache   W       626       546      626
##  8            11 W     flake    W       566       717      566
##  9            32 W     reliefs  W       922      1471      922
## 10            96 N     sarves   N       555       806      555
## # … with 990 more rows

Clearly, this is not ideal as it replaced the rt in participant too, leading to patimeicipant. In order to apply a renaming function to selected columns only, previously we would have used rename_at, as in the following example:

rename_at(blp_df,
          vars(matches('^rt|rt$')),
          ~str_replace_all(., 'rt', 'time'))

## # A tibble: 1,000 x 7
##    participant lex   spell    resp   time prev.time time.raw
##          <dbl> <chr> <chr>    <chr> <dbl>     <dbl>    <dbl>
##  1          20 N     staud    N       977       511      977
##  2           9 N     dinbuss  N       565       765      565
##  3          47 N     snilling N       562       496      562
##  4         103 N     gancens  N       572       656      572
##  5          45 W     filled   W       659       981      659
##  6          73 W     journals W       538      1505      538
##  7          24 W     apache   W       626       546      626
##  8          11 W     flake    W       566       717      566
##  9          32 W     reliefs  W       922      1471      922
## 10          96 N     sarves   N       555       806      555
## # … with 990 more rows

Here, the second line vars(matches('^rt|rt$')) selects those variables that match the regular expression ^rt|rt$, which selects only those names that begin or end with rt. The rename_with function can accomplish this too as follows:

rename_with(blp_df,
            ~str_replace_all(., 'rt', 'time'),
            matches('^rt|rt$'))

## # A tibble: 1,000 x 7
##    participant lex   spell    resp   time prev.time time.raw
##          <dbl> <chr> <chr>    <chr> <dbl>     <dbl>    <dbl>
##  1          20 N     staud    N       977       511      977
##  2           9 N     dinbuss  N       565       765      565
##  3          47 N     snilling N       562       496      562
##  4         103 N     gancens  N       572       656      572
##  5          45 W     filled   W       659       981      659
##  6          73 W     journals W       538      1505      538
##  7          24 W     apache   W       626       546      626
##  8          11 W     flake    W       566       717      566
##  9          32 W     reliefs  W       922      1471      922
## 10          96 N     sarves   N       555       806      555
## # … with 990 more rows

As an another example, consider renaming those columns whose values have certain characteristics. For example, we could rename the columns that are character vectors by converting their names to uppercase. Previously, we would have accomplished this with rename_if as follows:

rename_if(blp_df, is.character, toupper)

## # A tibble: 1,000 x 7
##    participant LEX   SPELL    RESP     rt prev.rt rt.raw
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

Now, we can accomplish this with rename_with and where as follows:

rename_with(blp_df, toupper, where(is.character))

## # A tibble: 1,000 x 7
##    participant LEX   SPELL    RESP     rt prev.rt rt.raw
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

Replacing `filter_(all|if|at)` with `filter` and `if_any` or `if_all`.

The filter verb will select rows of a data frame according to certain criteria. For example, the following gives us those observations where where lex has the value of N and resp has the value of W and rt.raw is less than or equal to 500.

filter(blp_df, lex == 'N', resp=='W', rt.raw <= 500)

## # A tibble: 5 x 7
##   participant lex   spell    resp     rt prev.rt rt.raw
##         <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
## 1          28 N     cown     W        NA     680    498
## 2          17 N     beeched  W        NA     450    469
## 3          29 N     conforn  W        NA     495    497
## 4          35 N     blear    W        NA     592    461
## 5          89 N     stumming W        NA     571    442

The filter_all variant of filter was previously used to filter those rows where a function returns true for any or all of the columns. For example, let’s say we want to obtain those rows where any of the variables have missing values. This was previously accomplished with filter_all with the help of any_vars as follows:

filter_all(blp_df, any_vars(is.na(.)))

## # A tibble: 179 x 7
##    participant lex   spell      resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>      <chr> <dbl>   <dbl>  <dbl>
##  1          37 W     nothings   N        NA     552    712
##  2          28 W     stelae     N        NA     678    497
##  3          85 W     forewarned N        NA     525    350
##  4         105 N     pamps      N        NA     884   1494
##  5          27 W     outgrowth  N        NA     633   1014
##  6          89 W     chards     N        NA     545    754
##  7          63 N     shrudule   N        NA       0   2553
##  8          73 W     chiggers   N        NA     726    654
##  9          73 N     bunding    W        NA     978   1279
## 10          22 W     aitches    N        NA     521    665
## # … with 169 more rows

If, on the other hand, we wanted to obtain those rows where all values are missing values, we could use all_vars as in the following:

filter_all(blp_df, all_vars(is.na(.)))

## # A tibble: 0 x 7
## # … with 7 variables: participant <dbl>, lex <chr>, spell <chr>, resp <chr>,
## #   rt <dbl>, prev.rt <dbl>, rt.raw <dbl>

This clearly returns an empty data frame as there are no rows where all values are NA. As a related example, let us say we wanted to select those rows where all values were numerically less than a specified threshold such as 500. We might try this:

filter_all(blp_df, all_vars(. < 500))

## # A tibble: 0 x 7
## # … with 7 variables: participant <dbl>, lex <chr>, spell <chr>, resp <chr>,
## #   rt <dbl>, prev.rt <dbl>, rt.raw <dbl>

This returns an empty data frame because when the function inside all_vars is applied to a character vector values, it will return a missing value. We therefore need to apply all_vars(. > 500) only to the numeric columns. This can be accomplished with filter_if as follows:

filter_if(blp_df, is.numeric, all_vars(. < 500))

## # A tibble: 69 x 7
##    participant lex   spell     resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
##  1          97 W     soda      W       436     447    436
##  2          81 N     fugate    N       425     403    425
##  3          82 W     kitty     W       431     476    431
##  4          66 N     freethyme N       494     491    494
##  5          19 N     jontage   N       413     471    413
##  6          44 W     snows     W       437     432    437
##  7           3 W     gleam     W       361     370    361
##  8          85 N     hirrs     N       485     467    485
##  9         103 N     tevs      N       489     491    489
## 10           4 W     midship   W       470     483    470
## # … with 59 more rows

Alternatively, we could apply all_vars(. > 500) only to the columns that begin or end with rt. This can be done with filter_at where we select the appropriate columns with the matches('^rt|rt$') we also used above.

filter_at(blp_df, vars(matches('^rt|rt$')), all_vars(. < 500))

## # A tibble: 69 x 7
##    participant lex   spell     resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
##  1          97 W     soda      W       436     447    436
##  2          81 N     fugate    N       425     403    425
##  3          82 W     kitty     W       431     476    431
##  4          66 N     freethyme N       494     491    494
##  5          19 N     jontage   N       413     471    413
##  6          44 W     snows     W       437     432    437
##  7           3 W     gleam     W       361     370    361
##  8          85 N     hirrs     N       485     467    485
##  9         103 N     tevs      N       489     491    489
## 10           4 W     midship   W       470     483    470
## # … with 59 more rows

These filter_all, filter_if, filter_at actions can now we accomplished by using filter itself with the help of functions like if_any or if_all. For example, to find all rows with missing values we can do the following:

filter(blp_df, if_any(everything(), ~is.na(.)))

## # A tibble: 179 x 7
##    participant lex   spell      resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>      <chr> <dbl>   <dbl>  <dbl>
##  1          37 W     nothings   N        NA     552    712
##  2          28 W     stelae     N        NA     678    497
##  3          85 W     forewarned N        NA     525    350
##  4         105 N     pamps      N        NA     884   1494
##  5          27 W     outgrowth  N        NA     633   1014
##  6          89 W     chards     N        NA     545    754
##  7          63 N     shrudule   N        NA       0   2553
##  8          73 W     chiggers   N        NA     726    654
##  9          73 N     bunding    W        NA     978   1279
## 10          22 W     aitches    N        NA     521    665
## # … with 169 more rows

Here, the function if_any takes as its first argument a function that selects which columns to which to apply the, in this case, anonymous function ~is.na(.). In other words, we apply the function ~is.na(.) to the values of all variables on all rows, and then we filter those rows where any returned value is true. We can use something other than everything() to accomplish that which was accomplished by filter_at or filter_if above. For example, to filter those rows of the numeric columns whose values are all less than 500, we can do the following, which replaces the filter_if example above:

filter(blp_df, if_all(where(is.numeric), ~. < 500))

## # A tibble: 69 x 7
##    participant lex   spell     resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
##  1          97 W     soda      W       436     447    436
##  2          81 N     fugate    N       425     403    425
##  3          82 W     kitty     W       431     476    431
##  4          66 N     freethyme N       494     491    494
##  5          19 N     jontage   N       413     471    413
##  6          44 W     snows     W       437     432    437
##  7           3 W     gleam     W       361     370    361
##  8          85 N     hirrs     N       485     467    485
##  9         103 N     tevs      N       489     491    489
## 10           4 W     midship   W       470     483    470
## # … with 59 more rows

Alternatively, to filter those rows of the columns that begin or end with rt whose values are all less than 500, we can do the following, which replaces the filter_at example above:

filter(blp_df, if_all(matches('^rt|rt$'), ~. < 500))

## # A tibble: 69 x 7
##    participant lex   spell     resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
##  1          97 W     soda      W       436     447    436
##  2          81 N     fugate    N       425     403    425
##  3          82 W     kitty     W       431     476    431
##  4          66 N     freethyme N       494     491    494
##  5          19 N     jontage   N       413     471    413
##  6          44 W     snows     W       437     432    437
##  7           3 W     gleam     W       361     370    361
##  8          85 N     hirrs     N       485     467    485
##  9         103 N     tevs      N       489     491    489
## 10           4 W     midship   W       470     483    470
## # … with 59 more rows

As some further examples, the following two commands select those rows where any or all, respectively, of the values of the columns from rt to rt.raw have values less than the median value of the column:

filter(blp_df, if_any(rt:rt.raw, ~. < median(., na.rm = T)))

## # A tibble: 694 x 7
##    participant lex   spell     resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud     N       977     511    977
##  2           9 N     dinbuss   N       565     765    565
##  3          47 N     snilling  N       562     496    562
##  4         103 N     gancens   N       572     656    572
##  5          73 W     journals  W       538    1505    538
##  6          24 W     apache    W       626     546    626
##  7          11 W     flake     W       566     717    566
##  8          96 N     sarves    N       555     806    555
##  9          37 W     nothings  N        NA     552    712
## 10          52 N     chuespies N       427     539    427
## # … with 684 more rows

filter(blp_df, if_all(rt:rt.raw, ~. < median(., na.rm = T)))

## # A tibble: 251 x 7
##    participant lex   spell     resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>     <chr> <dbl>   <dbl>  <dbl>
##  1          47 N     snilling  N       562     496    562
##  2          52 N     chuespies N       427     539    427
##  3           3 N     bromble   N       523     502    523
##  4          36 W     outposts  W       560     461    560
##  5          24 W     owl       W       470     535    470
##  6          97 W     soda      W       436     447    436
##  7          18 N     tesslier  N       560     477    560
##  8          81 N     fugate    N       425     403    425
##  9          29 N     placker   N       542     558    542
## 10          82 W     kitty     W       431     476    431
## # … with 241 more rows

Replacing `mutate_(all|if|at)` with `mutate` and `across`

The mutate verb either modifies existing variables in the data frame or else creates new ones. For example, we can create a new variable acc that takes the value of TRUE whenever lex and resp have the same value as follows¹:

mutate(blp_df, acc = lex == resp)

## # A tibble: 1,000 x 8
##    participant lex   spell    resp     rt prev.rt rt.raw acc  
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl> <lgl>
##  1          20 N     staud    N       977     511    977 TRUE 
##  2           9 N     dinbuss  N       565     765    565 TRUE 
##  3          47 N     snilling N       562     496    562 TRUE 
##  4         103 N     gancens  N       572     656    572 TRUE 
##  5          45 W     filled   W       659     981    659 TRUE 
##  6          73 W     journals W       538    1505    538 TRUE 
##  7          24 W     apache   W       626     546    626 TRUE 
##  8          11 W     flake    W       566     717    566 TRUE 
##  9          32 W     reliefs  W       922    1471    922 TRUE 
## 10          96 N     sarves   N       555     806    555 TRUE 
## # … with 990 more rows

We also modify an existing variable with mutate. For example, we can change the rt variable from millisecond units to seconds as follows:

mutate(blp_df, rt = rt / 1000)

## # A tibble: 1,000 x 7
##    participant lex   spell    resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N     0.977     511    977
##  2           9 N     dinbuss  N     0.565     765    565
##  3          47 N     snilling N     0.562     496    562
##  4         103 N     gancens  N     0.572     656    572
##  5          45 W     filled   W     0.659     981    659
##  6          73 W     journals W     0.538    1505    538
##  7          24 W     apache   W     0.626     546    626
##  8          11 W     flake    W     0.566     717    566
##  9          32 W     reliefs  W     0.922    1471    922
## 10          96 N     sarves   N     0.555     806    555
## # … with 990 more rows

If, on the other hand, we want to change the units of rt, prev.rt, and rt.raw from milliseconds to seconds, we previously would have used mutate_at. For example, the following selects the variables rt to rt.raw and applies the divide by 1000 function to them:

mutate_at(blp_df, vars(rt:rt.raw), ~./1000)

## # A tibble: 1,000 x 7
##    participant lex   spell    resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N     0.977   0.511  0.977
##  2           9 N     dinbuss  N     0.565   0.765  0.565
##  3          47 N     snilling N     0.562   0.496  0.562
##  4         103 N     gancens  N     0.572   0.656  0.572
##  5          45 W     filled   W     0.659   0.981  0.659
##  6          73 W     journals W     0.538   1.50   0.538
##  7          24 W     apache   W     0.626   0.546  0.626
##  8          11 W     flake    W     0.566   0.717  0.566
##  9          32 W     reliefs  W     0.922   1.47   0.922
## 10          96 N     sarves   N     0.555   0.806  0.555
## # … with 990 more rows

We now accomplish this using across inside mutate:

mutate(blp_df, across(rt:rt.raw, ~./1000))

## # A tibble: 1,000 x 7
##    participant lex   spell    resp     rt prev.rt rt.raw
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N     0.977   0.511  0.977
##  2           9 N     dinbuss  N     0.565   0.765  0.565
##  3          47 N     snilling N     0.562   0.496  0.562
##  4         103 N     gancens  N     0.572   0.656  0.572
##  5          45 W     filled   W     0.659   0.981  0.659
##  6          73 W     journals W     0.538   1.50   0.538
##  7          24 W     apache   W     0.626   0.546  0.626
##  8          11 W     flake    W     0.566   0.717  0.566
##  9          32 W     reliefs  W     0.922   1.47   0.922
## 10          96 N     sarves   N     0.555   0.806  0.555
## # … with 990 more rows

The use of across inside mutate can also replace mutate_if and mutate_all. For example, if we want to convert all the character vectors to factor variables, we would have previously used mutate_if, as in the following:

mutate_if(blp_df, is.character, as.factor)

## # A tibble: 1,000 x 7
##    participant lex   spell    resp     rt prev.rt rt.raw
##          <dbl> <fct> <fct>    <fct> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

Or if we wanted to convert all variables to factor variables, we would have done mutate_all, as in the following:

mutate_all(blp_df, as.factor)

## # A tibble: 1,000 x 7
##    participant lex   spell    resp  rt    prev.rt rt.raw
##    <fct>       <fct> <fct>    <fct> <fct> <fct>   <fct> 
##  1 20          N     staud    N     977   511     977   
##  2 9           N     dinbuss  N     565   765     565   
##  3 47          N     snilling N     562   496     562   
##  4 103         N     gancens  N     572   656     572   
##  5 45          W     filled   W     659   981     659   
##  6 73          W     journals W     538   1505    538   
##  7 24          W     apache   W     626   546     626   
##  8 11          W     flake    W     566   717     566   
##  9 32          W     reliefs  W     922   1471    922   
## 10 96          N     sarves   N     555   806     555   
## # … with 990 more rows

Now, we can accomplish this using across inside mutate:

# convert character vectors to factors
mutate(blp_df, across(where(is.character), as.factor))

## # A tibble: 1,000 x 7
##    participant lex   spell    resp     rt prev.rt rt.raw
##          <dbl> <fct> <fct>    <fct> <dbl>   <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977
##  2           9 N     dinbuss  N       565     765    565
##  3          47 N     snilling N       562     496    562
##  4         103 N     gancens  N       572     656    572
##  5          45 W     filled   W       659     981    659
##  6          73 W     journals W       538    1505    538
##  7          24 W     apache   W       626     546    626
##  8          11 W     flake    W       566     717    566
##  9          32 W     reliefs  W       922    1471    922
## 10          96 N     sarves   N       555     806    555
## # … with 990 more rows

# convert all vectors to factors
mutate(blp_df, across(everything(), as.factor))

## # A tibble: 1,000 x 7
##    participant lex   spell    resp  rt    prev.rt rt.raw
##    <fct>       <fct> <fct>    <fct> <fct> <fct>   <fct> 
##  1 20          N     staud    N     977   511     977   
##  2 9           N     dinbuss  N     565   765     565   
##  3 47          N     snilling N     562   496     562   
##  4 103         N     gancens  N     572   656     572   
##  5 45          W     filled   W     659   981     659   
##  6 73          W     journals W     538   1505    538   
##  7 24          W     apache   W     626   546     626   
##  8 11          W     flake    W     566   717     566   
##  9 32          W     reliefs  W     922   1471    922   
## 10 96          N     sarves   N     555   806     555   
## # … with 990 more rows

As another example, if we wish to rescale all the numeric variables so that they have a mean of zero and standard deviation of one, we previously would have used mutate_if:

mutate_if(blp_df, is.numeric, ~as.vector(scale(.)))

## # A tibble: 1,000 x 7
##    participant lex   spell    resp       rt prev.rt rt.raw
##          <dbl> <chr> <chr>    <chr>   <dbl>   <dbl>  <dbl>
##  1     -0.946  N     staud    N      1.77   -0.588   0.567
##  2     -1.30   N     dinbuss  N     -0.381   0.414  -0.301
##  3     -0.0790 N     snilling N     -0.396  -0.647  -0.308
##  4      1.72   N     gancens  N     -0.344  -0.0162 -0.286
##  5     -0.143  W     filled   W      0.111   1.27   -0.103
##  6      0.756  W     journals W     -0.522   3.33   -0.358
##  7     -0.817  W     apache   W     -0.0616 -0.450  -0.173
##  8     -1.23   W     flake    W     -0.375   0.224  -0.299
##  9     -0.561  W     reliefs  W      1.49    3.20    0.451
## 10      1.49   N     sarves   N     -0.433   0.576  -0.322
## # … with 990 more rows

Now, we use across inside mutate:

mutate(blp_df, across(where(is.numeric), ~as.vector(scale(.))))

## # A tibble: 1,000 x 7
##    participant lex   spell    resp       rt prev.rt rt.raw
##          <dbl> <chr> <chr>    <chr>   <dbl>   <dbl>  <dbl>
##  1     -0.946  N     staud    N      1.77   -0.588   0.567
##  2     -1.30   N     dinbuss  N     -0.381   0.414  -0.301
##  3     -0.0790 N     snilling N     -0.396  -0.647  -0.308
##  4      1.72   N     gancens  N     -0.344  -0.0162 -0.286
##  5     -0.143  W     filled   W      0.111   1.27   -0.103
##  6      0.756  W     journals W     -0.522   3.33   -0.358
##  7     -0.817  W     apache   W     -0.0616 -0.450  -0.173
##  8     -1.23   W     flake    W     -0.375   0.224  -0.299
##  9     -0.561  W     reliefs  W      1.49    3.20    0.451
## 10      1.49   N     sarves   N     -0.433   0.576  -0.322
## # … with 990 more rows

As a final example, let’s say we want to recode the N and W in the variables lex and resp to nonword and word, respectively. Previously, we would have used mutate_at as follows:

mutate_at(blp_df,
          vars(c(lex, resp)),
          ~recode(., 'W' = 'word', 'N' = 'nonword')
)

## # A tibble: 1,000 x 7
##    participant lex     spell    resp       rt prev.rt rt.raw
##          <dbl> <chr>   <chr>    <chr>   <dbl>   <dbl>  <dbl>
##  1          20 nonword staud    nonword   977     511    977
##  2           9 nonword dinbuss  nonword   565     765    565
##  3          47 nonword snilling nonword   562     496    562
##  4         103 nonword gancens  nonword   572     656    572
##  5          45 word    filled   word      659     981    659
##  6          73 word    journals word      538    1505    538
##  7          24 word    apache   word      626     546    626
##  8          11 word    flake    word      566     717    566
##  9          32 word    reliefs  word      922    1471    922
## 10          96 nonword sarves   nonword   555     806    555
## # … with 990 more rows

Now, we use mutate with across:

mutate(blp_df,
       across(c(lex, resp), 
              ~recode(., 'W' = 'word', 'N' = 'nonword'))
)

## # A tibble: 1,000 x 7
##    participant lex     spell    resp       rt prev.rt rt.raw
##          <dbl> <chr>   <chr>    <chr>   <dbl>   <dbl>  <dbl>
##  1          20 nonword staud    nonword   977     511    977
##  2           9 nonword dinbuss  nonword   565     765    565
##  3          47 nonword snilling nonword   562     496    562
##  4         103 nonword gancens  nonword   572     656    572
##  5          45 word    filled   word      659     981    659
##  6          73 word    journals word      538    1505    538
##  7          24 word    apache   word      626     546    626
##  8          11 word    flake    word      566     717    566
##  9          32 word    reliefs  word      922    1471    922
## 10          96 nonword sarves   nonword   555     806    555
## # … with 990 more rows

Replacing `summarize_(all|if|at)` with `summarize` and `across`

The summarize (or summarise) function can be used to calculate summary statistics of variables. For example, to calculate the mean and standard deviation of the rt, and the median and MAD of rt.raw, we would do the following:

summarise(blp_df, 
          avg_rt = mean(rt, na.rm = T),
          sd_rt = sd(rt, na.rm = T),
          median_raw = median(rt.raw, na.rm = T),
          mad_raw = mad(rt.raw, na.rm = T)
)

## # A tibble: 1 x 4
##   avg_rt sd_rt median_raw mad_raw
##    <dbl> <dbl>      <dbl>   <dbl>
## 1   638.  191.        605    153.

The summarize_all variant could be used to apply a summarization function to all variables. For example, to calculate the number of distinct elements in each variable, we would previously have used summarize_all:

summarise_all(blp_df, n_distinct)

## # A tibble: 1 x 7
##   participant   lex spell  resp    rt prev.rt rt.raw
##         <int> <int> <int> <int> <int>   <int>  <int>
## 1          78     2   990     2   421     493    516

Now, we use summarize with across:

summarize(blp_df, across(everything(), n_distinct))

## # A tibble: 1 x 7
##   participant   lex spell  resp    rt prev.rt rt.raw
##         <int> <int> <int> <int> <int>   <int>  <int>
## 1          78     2   990     2   421     493    516

The use of across in summarize also replaces summarize_if and summarize_at. For example, to calculate the mean of the variables that begin or end with rt, we would previously have used summarize_at as follows:

summarise_at(blp_df,
             vars(matches('^rt|rt$')),
             ~mean(., na.rm = T))

## # A tibble: 1 x 3
##      rt prev.rt rt.raw
##   <dbl>   <dbl>  <dbl>
## 1  638.    660.   708.

Now, we use across inside mutate:

summarize(blp_df,
          across(matches('^rt|rt$'),
                 ~mean(., na.rm = T))
)

## # A tibble: 1 x 3
##      rt prev.rt rt.raw
##   <dbl>   <dbl>  <dbl>
## 1  638.    660.   708.

Likewise, to calculate the mean of all numeric variables, we previously would have used summarize_if as follows:

summarise_if(blp_df, is.numeric, ~mean(., na.rm = T))

## # A tibble: 1 x 4
##   participant    rt prev.rt rt.raw
##         <dbl> <dbl>   <dbl>  <dbl>
## 1        49.5  638.    660.   708.

Now, we use across inside mutate:

summarize(blp_df, across(where(is.numeric),
                         ~mean(., na.rm = T))
)

## # A tibble: 1 x 4
##   participant    rt prev.rt rt.raw
##         <dbl> <dbl>   <dbl>  <dbl>
## 1        49.5  638.    660.   708.

Note that if we want to apply multiple summarization functions to the variables selected by across, we can provide a list of functions as in the following example where we calculate the mean and the median of three selected variables:

summarize(blp_df,
          across(matches('^rt|rt$'),
                 list(avg = ~mean(., na.rm = T),
                      median = ~median(., na.rm = T))
          )
)

## # A tibble: 1 x 6
##   rt_avg rt_median prev.rt_avg prev.rt_median rt.raw_avg rt.raw_median
##    <dbl>     <dbl>       <dbl>          <dbl>      <dbl>         <dbl>
## 1   638.       588        660.            594       708.           605

Rowwise operations using `c_across`

In dplyr version 1.0, the function c_across was introduced that allows us to perform operations across rows. As an example, let us say we want to calculate the mean for each row of the three columns that begin or end with rt. We accomplish this by selecting the three columns inside c_across, which is put inside the mean function. However, it is necessary to first apply the rowwise function to the data frame to ensure that the mean operation is applied across rows:

rowwise(blp_df) %>% 
  mutate(rt_avg = mean(c_across(rt:rt.raw)))

## # A tibble: 1,000 x 8
## # Rowwise: 
##    participant lex   spell    resp     rt prev.rt rt.raw rt_avg
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>  <dbl>
##  1          20 N     staud    N       977     511    977   822.
##  2           9 N     dinbuss  N       565     765    565   632.
##  3          47 N     snilling N       562     496    562   540 
##  4         103 N     gancens  N       572     656    572   600 
##  5          45 W     filled   W       659     981    659   766.
##  6          73 W     journals W       538    1505    538   860.
##  7          24 W     apache   W       626     546    626   599.
##  8          11 W     flake    W       566     717    566   616.
##  9          32 W     reliefs  W       922    1471    922  1105 
## 10          96 N     sarves   N       555     806    555   639.
## # … with 990 more rows

In this example, we selected the appropriate columns with using a range of column names. We could use any other way of selecting variables, including with where. For example, here we calculate the maximum values across all numeric variables in the data frame:

rowwise(blp_df) %>% 
  mutate(max_val = max(c_across(where(is.numeric))))

## # A tibble: 1,000 x 8
## # Rowwise: 
##    participant lex   spell    resp     rt prev.rt rt.raw max_val
##          <dbl> <chr> <chr>    <chr> <dbl>   <dbl>  <dbl>   <dbl>
##  1          20 N     staud    N       977     511    977     977
##  2           9 N     dinbuss  N       565     765    565     765
##  3          47 N     snilling N       562     496    562     562
##  4         103 N     gancens  N       572     656    572     656
##  5          45 W     filled   W       659     981    659     981
##  6          73 W     journals W       538    1505    538    1505
##  7          24 W     apache   W       626     546    626     626
##  8          11 W     flake    W       566     717    566     717
##  9          32 W     reliefs  W       922    1471    922    1471
## 10          96 N     sarves   N       555     806    555     806
## # … with 990 more rows

The blp_df data is from a lexical decision task. The lex variable indicates if the string is a real word or not. The resp indicates if the participant responded that the string is a word or not. Therefore, whenever lex and resp are identical, the participant was accurate in their response.↩︎

Superseding dplyr's suffixed variants

Mark Andrews. October 27, 2020

Replacing select_if with select and where

Replacing rename_all, rename_at, rename_if with rename_with

Replacing filter_(all|if|at) with filter and if_any or if_all.

Replacing mutate_(all|if|at) with mutate and across

Replacing summarize_(all|if|at) with summarize and across

Rowwise operations using c_across

Replacing `select_if` with `select` and `where`

Replacing `rename_all`, `rename_at`, `rename_if` with `rename_with`

Replacing `filter_(all|if|at)` with `filter` and `if_any` or `if_all`.

Replacing `mutate_(all|if|at)` with `mutate` and `across`

Replacing `summarize_(all|if|at)` with `summarize` and `across`

Rowwise operations using `c_across`