This post explains how we can use across, where(), rename_with(), etc., to perform actions that were previously accomplished with the _if, _at, _all variants of the dplyr verbs.
Author
Mark Andrews
Published
October 27, 2020
The dplyr package describes itself as a “grammar of data manipulation” and its main features are its “verbs” like select, filter, mutate, and so on. Prior to version 1.0 of dplyr, the verbs had variants with the suffices _if, _at, _all. These extended the basic functionality of each of the dplyr verbs. As of 1.0, however, these suffixed variants have been superseded by another set of functions, often used in combination, including across, where, rename_with, if_any, if_all, and so on. Why the developers of dplyr chose to make these changes is described here. What this means in practice is that if you are already using dplyr, while you can continue using the suffixed variants, using across, where, etc is now the recommended way to accomplish that which was previously accomplished by these variants. If, on the other hand, you are new to dplyr, it is not worthwhile learning _if, _at, _all, other than to know that they were the previous way of doing things that are now accomplished with new functions.
In this post, I explain how we can use across, where, rename_with, etc., to perform actions that were previously accomplished with the suffixed variants of dplyr verbs. In addition, we will also cover a related function, c_across, which is used for rowwise operations in data-frames that were previously very awkward to accomplish in other ways. While I briefly summarize the main dplyr verbs in this post, I will assume that readers are already familiar, to at least minimal extent, with these verbs.
In this coverage, we will use the data frame blp_df (the csv file of which is here):
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Replacing select_if with select and where
The select verb allows us to select columns in a variety of different ways such as their name, or position, or what characters their names begin or end with, etc. For example, here we select the columns that start with the string rt:
We can also select columns according to the properties of the values of the variable. For example, we can select those variables whose values are character strings, or numeric, or logical, etc. Previously, this was accomplished using select_if, whereby we selected a column if its values meet certain conditions. For example, to select the columns that are character vectors, we previously did the following:
select_if(blp_df, is.character)
# A tibble: 1,000 × 3
lex spell resp
<chr> <chr> <chr>
1 N staud N
2 N dinbuss N
3 N snilling N
4 N gancens N
5 W filled W
6 W journals W
7 W apache W
8 W flake W
9 W reliefs W
10 N sarves N
# ℹ 990 more rows
Now, we can accomplish this with where() as follows:
select(blp_df, where(is.character))
# A tibble: 1,000 × 3
lex spell resp
<chr> <chr> <chr>
1 N staud N
2 N dinbuss N
3 N snilling N
4 N gancens N
5 W filled W
6 W journals W
7 W apache W
8 W flake W
9 W reliefs W
10 N sarves N
# ℹ 990 more rows
When using where like this, we can also use custom functions like the following:
# return TRUE if `x` is numeric and its mean > 700.has_high_mean <-function(x){is.numeric(x) && (mean(x, na.rm =TRUE) >700)}select(blp_df, where(has_high_mean))
As we can see, we select those columns for which the custom function returns TRUE.
Note that we can also use anonymous functions, such as purrr lambdas as described here, inside where. For example, we can accomplish the previous command with an anonymous function as follows:
Replacing rename_all, rename_at, rename_if with rename_with
The rename verb allows us rename column names. For example, if we want to rename participant with subject, we do the following:
rename(blp_df, subject = participant)
# A tibble: 1,000 × 7
subject lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
We can also use a renaming function that can be applied to all names. For example, we could use a function to convert all names to uppercase, or a function that replaces any occurrence of a dot with an underscore. Previously, this was accomplished with rename_all.
rename_all(blp_df, toupper) # convert all names to uppercase
# A tibble: 1,000 × 7
PARTICIPANT LEX SPELL RESP RT PREV.RT RT.RAW
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
rename_all(blp_df, ~str_replace_all(., '\\.', '_')) # replace dots with underscores
# A tibble: 1,000 × 7
participant lex spell resp rt prev_rt rt_raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
We can now accomplish this by using rename_with instead of rename_all.
rename_with(blp_df, toupper)
# A tibble: 1,000 × 7
PARTICIPANT LEX SPELL RESP RT PREV.RT RT.RAW
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
# A tibble: 1,000 × 7
participant lex spell resp rt prev_rt rt_raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Admittedly, this appears like we have just renamed the function rename_all with rename_with. However, rename_with is more versatile than rename_all, and replaces other suffixed variants of rename such as rename_at and rename_if. As an example, let us say we want to replace the stringrt (reaction time) with time in the column names so that, for example, prev.rt is replaced with prev_time, etc. We could try rename_with as follows:
# A tibble: 1,000 × 7
patimeicipant lex spell resp time prev.time time.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Clearly, this is not ideal as it replaced the rt in participant too, leading to patimeicipant. In order to apply a renaming function to selected columns only, previously we would have used rename_at, as in the following example:
# A tibble: 1,000 × 7
participant lex spell resp time prev.time time.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Here, the second line vars(matches('^rt|rt$')) selects those variables that match the regular expression ^rt|rt$, which selects only those names that begin or end with rt. The rename_with function can accomplish this too as follows:
# A tibble: 1,000 × 7
participant lex spell resp time prev.time time.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
As an another example, consider renaming those columns whose values have certain characteristics. For example, we could rename the columns that are character vectors by converting their names to uppercase. Previously, we would have accomplished this with rename_if as follows:
rename_if(blp_df, is.character, toupper)
# A tibble: 1,000 × 7
participant LEX SPELL RESP rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Now, we can accomplish this with rename_with and where as follows:
rename_with(blp_df, toupper, where(is.character))
# A tibble: 1,000 × 7
participant LEX SPELL RESP rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Replacing filter_(all|if|at) with filter and if_any or if_all.
The filter verb will select rows of a data frame according to certain criteria. For example, the following gives us those observations where where lex has the value of N and resp has the value of W and rt.raw is less than or equal to 500.
# A tibble: 5 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 28 N cown W NA 680 498
2 17 N beeched W NA 450 469
3 29 N conforn W NA 495 497
4 35 N blear W NA 592 461
5 89 N stumming W NA 571 442
The filter_all variant of filter was previously used to filter those rows where a function returns true for any or all of the columns. For example, let’s say we want to obtain those rows where any of the variables have missing values. This was previously accomplished with filter_all with the help of any_vars as follows:
filter_all(blp_df, any_vars(is.na(.)))
# A tibble: 179 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 37 W nothings N NA 552 712
2 28 W stelae N NA 678 497
3 85 W forewarned N NA 525 350
4 105 N pamps N NA 884 1494
5 27 W outgrowth N NA 633 1014
6 89 W chards N NA 545 754
7 63 N shrudule N NA 0 2553
8 73 W chiggers N NA 726 654
9 73 N bunding W NA 978 1279
10 22 W aitches N NA 521 665
# ℹ 169 more rows
If, on the other hand, we wanted to obtain those rows where all values are missing values, we could use all_vars as in the following:
This clearly returns an empty data frame as there are no rows where all values are NA. As a related example, let us say we wanted to select those rows where all values were numerically less than a specified threshold such as 500. We might try this:
This returns an empty data frame because when the function inside all_vars is applied to a character vector values, it will return a missing value. We therefore need to apply all_vars(. > 500) only to the numeric columns. This can be accomplished with filter_if as follows:
filter_if(blp_df, is.numeric, all_vars(. <500))
# A tibble: 69 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 97 W soda W 436 447 436
2 81 N fugate N 425 403 425
3 82 W kitty W 431 476 431
4 66 N freethyme N 494 491 494
5 19 N jontage N 413 471 413
6 44 W snows W 437 432 437
7 3 W gleam W 361 370 361
8 85 N hirrs N 485 467 485
9 103 N tevs N 489 491 489
10 4 W midship W 470 483 470
# ℹ 59 more rows
Alternatively, we could apply all_vars(. > 500) only to the columns that begin or end with rt. This can be done with filter_at where we select the appropriate columns with the matches('^rt|rt$') we also used above.
# A tibble: 69 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 97 W soda W 436 447 436
2 81 N fugate N 425 403 425
3 82 W kitty W 431 476 431
4 66 N freethyme N 494 491 494
5 19 N jontage N 413 471 413
6 44 W snows W 437 432 437
7 3 W gleam W 361 370 361
8 85 N hirrs N 485 467 485
9 103 N tevs N 489 491 489
10 4 W midship W 470 483 470
# ℹ 59 more rows
These filter_all, filter_if, filter_at actions can now we accomplished by using filter itself with the help of functions like if_any or if_all. For example, to find all rows with missing values we can do the following:
filter(blp_df, if_any(everything(), ~is.na(.)))
# A tibble: 179 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 37 W nothings N NA 552 712
2 28 W stelae N NA 678 497
3 85 W forewarned N NA 525 350
4 105 N pamps N NA 884 1494
5 27 W outgrowth N NA 633 1014
6 89 W chards N NA 545 754
7 63 N shrudule N NA 0 2553
8 73 W chiggers N NA 726 654
9 73 N bunding W NA 978 1279
10 22 W aitches N NA 521 665
# ℹ 169 more rows
Here, the function if_any takes as its first argument a function that selects which columns to which to apply the, in this case, anonymous function ~is.na(.). In other words, we apply the function ~is.na(.) to the values of all variables on all rows, and then we filter those rows where any returned value is true. We can use something other than everything() to accomplish that which was accomplished by filter_at or filter_if above. For example, to filter those rows of the numeric columns whose values are all less than 500, we can do the following, which replaces the filter_if example above:
# A tibble: 69 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 97 W soda W 436 447 436
2 81 N fugate N 425 403 425
3 82 W kitty W 431 476 431
4 66 N freethyme N 494 491 494
5 19 N jontage N 413 471 413
6 44 W snows W 437 432 437
7 3 W gleam W 361 370 361
8 85 N hirrs N 485 467 485
9 103 N tevs N 489 491 489
10 4 W midship W 470 483 470
# ℹ 59 more rows
Alternatively, to filter those rows of the columns that begin or end with rt whose values are all less than 500, we can do the following, which replaces the filter_at example above:
# A tibble: 69 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 97 W soda W 436 447 436
2 81 N fugate N 425 403 425
3 82 W kitty W 431 476 431
4 66 N freethyme N 494 491 494
5 19 N jontage N 413 471 413
6 44 W snows W 437 432 437
7 3 W gleam W 361 370 361
8 85 N hirrs N 485 467 485
9 103 N tevs N 489 491 489
10 4 W midship W 470 483 470
# ℹ 59 more rows
As some further examples, the following two commands select those rows where any or all, respectively, of the values of the columns from rt to rt.raw have values less than the median value of the column:
# A tibble: 694 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 73 W journals W 538 1505 538
6 24 W apache W 626 546 626
7 11 W flake W 566 717 566
8 96 N sarves N 555 806 555
9 37 W nothings N NA 552 712
10 52 N chuespies N 427 539 427
# ℹ 684 more rows
# A tibble: 251 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 47 N snilling N 562 496 562
2 52 N chuespies N 427 539 427
3 3 N bromble N 523 502 523
4 36 W outposts W 560 461 560
5 24 W owl W 470 535 470
6 97 W soda W 436 447 436
7 18 N tesslier N 560 477 560
8 81 N fugate N 425 403 425
9 29 N placker N 542 558 542
10 82 W kitty W 431 476 431
# ℹ 241 more rows
Replacing mutate_(all|if|at) with mutate and across
The mutate verb either modifies existing variables in the data frame or else creates new ones. For example, we can create a new variable acc that takes the value of TRUE whenever lex and resp have the same value as follows1:
mutate(blp_df, acc = lex == resp)
# A tibble: 1,000 × 8
participant lex spell resp rt prev.rt rt.raw acc
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
1 20 N staud N 977 511 977 TRUE
2 9 N dinbuss N 565 765 565 TRUE
3 47 N snilling N 562 496 562 TRUE
4 103 N gancens N 572 656 572 TRUE
5 45 W filled W 659 981 659 TRUE
6 73 W journals W 538 1505 538 TRUE
7 24 W apache W 626 546 626 TRUE
8 11 W flake W 566 717 566 TRUE
9 32 W reliefs W 922 1471 922 TRUE
10 96 N sarves N 555 806 555 TRUE
# ℹ 990 more rows
We also modify an existing variable with mutate. For example, we can change the rt variable from millisecond units to seconds as follows:
mutate(blp_df, rt = rt /1000)
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 0.977 511 977
2 9 N dinbuss N 0.565 765 565
3 47 N snilling N 0.562 496 562
4 103 N gancens N 0.572 656 572
5 45 W filled W 0.659 981 659
6 73 W journals W 0.538 1505 538
7 24 W apache W 0.626 546 626
8 11 W flake W 0.566 717 566
9 32 W reliefs W 0.922 1471 922
10 96 N sarves N 0.555 806 555
# ℹ 990 more rows
If, on the other hand, we want to change the units of rt, prev.rt, and rt.raw from milliseconds to seconds, we previously would have used mutate_at. For example, the following selects the variables rt to rt.raw and applies the divide by 1000 function to them:
mutate_at(blp_df, vars(rt:rt.raw), ~./1000)
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 0.977 0.511 0.977
2 9 N dinbuss N 0.565 0.765 0.565
3 47 N snilling N 0.562 0.496 0.562
4 103 N gancens N 0.572 0.656 0.572
5 45 W filled W 0.659 0.981 0.659
6 73 W journals W 0.538 1.50 0.538
7 24 W apache W 0.626 0.546 0.626
8 11 W flake W 0.566 0.717 0.566
9 32 W reliefs W 0.922 1.47 0.922
10 96 N sarves N 0.555 0.806 0.555
# ℹ 990 more rows
We now accomplish this using across inside mutate:
mutate(blp_df, across(rt:rt.raw, ~./1000))
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 N staud N 0.977 0.511 0.977
2 9 N dinbuss N 0.565 0.765 0.565
3 47 N snilling N 0.562 0.496 0.562
4 103 N gancens N 0.572 0.656 0.572
5 45 W filled W 0.659 0.981 0.659
6 73 W journals W 0.538 1.50 0.538
7 24 W apache W 0.626 0.546 0.626
8 11 W flake W 0.566 0.717 0.566
9 32 W reliefs W 0.922 1.47 0.922
10 96 N sarves N 0.555 0.806 0.555
# ℹ 990 more rows
The use of across inside mutate can also replace mutate_if and mutate_all. For example, if we want to convert all the character vectors to factor variables, we would have previously used mutate_if, as in the following:
mutate_if(blp_df, is.character, as.factor)
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Or if we wanted to convert all variables to factor variables, we would have done mutate_all, as in the following:
mutate_all(blp_df, as.factor)
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
Now, we can accomplish this using across inside mutate:
# convert character vectors to factorsmutate(blp_df, across(where(is.character), as.factor))
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
# convert all vectors to factorsmutate(blp_df, across(everything(), as.factor))
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 20 N staud N 977 511 977
2 9 N dinbuss N 565 765 565
3 47 N snilling N 562 496 562
4 103 N gancens N 572 656 572
5 45 W filled W 659 981 659
6 73 W journals W 538 1505 538
7 24 W apache W 626 546 626
8 11 W flake W 566 717 566
9 32 W reliefs W 922 1471 922
10 96 N sarves N 555 806 555
# ℹ 990 more rows
As another example, if we wish to rescale all the numeric variables so that they have a mean of zero and standard deviation of one, we previously would have used mutate_if:
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 -0.946 N staud N 1.77 -0.588 0.567
2 -1.30 N dinbuss N -0.381 0.414 -0.301
3 -0.0790 N snilling N -0.396 -0.647 -0.308
4 1.72 N gancens N -0.344 -0.0162 -0.286
5 -0.143 W filled W 0.111 1.27 -0.103
6 0.756 W journals W -0.522 3.33 -0.358
7 -0.817 W apache W -0.0616 -0.450 -0.173
8 -1.23 W flake W -0.375 0.224 -0.299
9 -0.561 W reliefs W 1.49 3.20 0.451
10 1.49 N sarves N -0.433 0.576 -0.322
# ℹ 990 more rows
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 -0.946 N staud N 1.77 -0.588 0.567
2 -1.30 N dinbuss N -0.381 0.414 -0.301
3 -0.0790 N snilling N -0.396 -0.647 -0.308
4 1.72 N gancens N -0.344 -0.0162 -0.286
5 -0.143 W filled W 0.111 1.27 -0.103
6 0.756 W journals W -0.522 3.33 -0.358
7 -0.817 W apache W -0.0616 -0.450 -0.173
8 -1.23 W flake W -0.375 0.224 -0.299
9 -0.561 W reliefs W 1.49 3.20 0.451
10 1.49 N sarves N -0.433 0.576 -0.322
# ℹ 990 more rows
As a final example, let’s say we want to recode the N and W in the variables lex and resp to nonword and word, respectively. Previously, we would have used mutate_at as follows:
# A tibble: 1,000 × 7
participant lex spell resp rt prev.rt rt.raw
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 20 nonword staud nonword 977 511 977
2 9 nonword dinbuss nonword 565 765 565
3 47 nonword snilling nonword 562 496 562
4 103 nonword gancens nonword 572 656 572
5 45 word filled word 659 981 659
6 73 word journals word 538 1505 538
7 24 word apache word 626 546 626
8 11 word flake word 566 717 566
9 32 word reliefs word 922 1471 922
10 96 nonword sarves nonword 555 806 555
# ℹ 990 more rows
Replacing summarize_(all|if|at) with summarize and across
The summarize (or summarise) function can be used to calculate summary statistics of variables. For example, to calculate the mean and standard deviation of the rt, and the median and MAD of rt.raw, we would do the following:
The summarize_all variant could be used to apply a summarization function to all variables. For example, to calculate the number of distinct elements in each variable, we would previously have used summarize_all:
The use of across in summarize also replaces summarize_if and summarize_at. For example, to calculate the mean of the variables that begin or end with rt, we would previously have used summarize_at as follows:
Note that if we want to apply multiple summarization functions to the variables selected by across, we can provide a list of functions as in the following example where we calculate the mean and the median of three selected variables:
In dplyr version 1.0, the function c_across was introduced that allows us to perform operations across rows. As an example, let us say we want to calculate the mean for each row of the three columns that begin or end with rt. We accomplish this by selecting the three columns inside c_across, which is put inside the mean function. However, it is necessary to first apply the rowwise function to the data frame to ensure that the mean operation is applied across rows:
# A tibble: 1,000 × 8
# Rowwise:
participant lex spell resp rt prev.rt rt.raw rt_avg
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 20 N staud N 977 511 977 822.
2 9 N dinbuss N 565 765 565 632.
3 47 N snilling N 562 496 562 540
4 103 N gancens N 572 656 572 600
5 45 W filled W 659 981 659 766.
6 73 W journals W 538 1505 538 860.
7 24 W apache W 626 546 626 599.
8 11 W flake W 566 717 566 616.
9 32 W reliefs W 922 1471 922 1105
10 96 N sarves N 555 806 555 639.
# ℹ 990 more rows
In this example, we selected the appropriate columns with using a range of column names. We could use any other way of selecting variables, including with where. For example, here we calculate the maximum values across all numeric variables in the data frame:
The blp_df data is from a lexical decision task. The lex variable indicates if the string is a real word or not. The resp indicates if the participant responded that the string is a word or not. Therefore, whenever lex and resp are identical, the participant was accurate in their response.↩︎