<- function(x){
sawtooth if (x < 0){
0
else if (x < 1/3) {
} 3 * x
else if (x < 2/3) {
} 3 * x - 1
else if (x <= 1) {
} 3 * x - 2
else {
} 0
} }
Iterations in R with lapply (etc) and purrr functionals
This post describes how to perform iterations using functionals in base and in the purrr package. The base R functionals particularly include lapply and its variants like sapply, vapply, and mapply. In the purrr package, the principal functional is map, of which there are numerious variants.
A common feature of all programming languages is the ability to perform iterations whereby a code statement is executed repeatedly. In most programming languages, though not all, the standard way to do this is with a loop, usually a so-called for-loop. R allows us to write for-loops, and these are commonly used. However, R also provides us with various functionals, which are functions that takes functions as input. In many cases, functionals can replace for-loops with single short statements that are often far easier to read, write, and understand. In base R, functionals are provided by the *apply
family, including lapply
, sapply
, vapply
, etc. More recently, the purrr
package provides replacements for these *apply
functionals that are arguably simultaneously more powerful and easier to use than the *apply
family.
The aim of this post is to ultimately introduce the purrr
functionals, which are powerful tools that every serious R user should know about and use. To properly introduce purrr
functionals, it is necessary and otherwise generally useful to also know about the *apply
functionals they aim to replace. And to understand the *apply
functionals, it is necessary to understand for-loops.
for-loops
As a simple but non-trivial example of for-loops in R, let us consider how we would use the following sawtooth
function.
This function implements a piecewise linear map: for any value \(x \in \mathbb{R}\), it maps it in a piecewise linear manner, to a value between 0 and 1. It is illustrated in the following figure.
Suppose we had the following vector of 1000 elements to which we wished to apply the sawtooth
function.
<- 1000
N <- seq(-0.1, 1.1, length.out = N) x
Although it may seem like the obvious choice, we can not do the following because of the if ... else
statements in the function’s body.
sawtooth(x) # this won't work
In order to apply sawtooth
to all elements of x
, we could in principle apply sawtooth
to eac element of x
, one at a time, as follows.
# Create a vector of 0's of same length as x
# This can also be done with `y <- vector('double', N)`
<- numeric(N)
y
1] <- sawtooth(x[1])
y[2] <- sawtooth(x[2])
y[3] <- sawtooth(x[3])
y[
...<- sawtooth(x[N]) # where N = 1000 y[N]
It should be obvious that we want to avoid this code duplication at all costs. Instead, we can create a for-loop as follows.
<- numeric(N)
y for (i in 1:N) {
<- sawtooth(x[i])
y[i] }
Essentially, this loop repeatedly executes the statement y[i] <- sawtooth(x[i])
. On the first iteration, i
takes the value of 1
. On the second iteration, i
takes the value of 2
, and so on, until the final iteration where i
takes the values of N
. In other words, for each value of i
from 1
to N
, we execute y[i] <- sawtooth(x[i])
.
The general form of a for loop is as follows.
for (<var> in <sequence>) {
<code body>
}
The for loop iteration begins with the for
keyword followed by a round bracketed expression of the form (<var> in <sequence>)
where <var>
is what we’ll call the loop variable and <sequence>
is (usually) a vector or list of items. After the round brackets is some code enclosed by {}
. For each value in the <sequence>
, the <var>
is set to this value and the <code body>
is executed. We could write this in pseudo-code as follows.
for each value in <sequence>
set <var> equal to this value
execute <code body>
In other words, the for loop executes the <code body>
for each value in <sequence>
, setting <var>
to this value on each iteration.
From for-loops to apply functionals
Functionals are functions that take a function as input and return a vector. They play an important role in programming in R, can can replace many, though not all, for-loop constructions. There are many functionals in the base R language. Here, we will look at the most useful or widely used ones.
lapply
One of the most widely used functionals in base R is the lapply
function, which takes two required arguments, a vector or list and a function, and then applies the function to each element in the vector or list and returns a new list. As an example, instead of using a for-loop, we could use lapply
to apply the sawtooth
function to each element of a vector x
. Here’s the original for loop.
<- numeric(N)
y for (i in 1:N) {
<- sawtooth(x[i])
y[i] }
We can replace this entirely using lapply
as follows.
<- lapply(x, sawtooth) y
The lapply
function returns a list with as many elements as there are elements in x
. Should we prefer that y
be a vector rather than a list, we can unlist
it, but there are other options too for returning a vector from a functional, as we will see shortly.
The general form of the lapply
function is as follows.
returned_list <- lapply(<vector or list>, <function>)
Sometimes, we may need to use a function that takes arguments in lapply
. As an example, let us imagine that we wish to calculate the trimmed mean of each vector, and also ignoring missing values, in a list of vectors. The trimmed mean is where we compute the mean after a certain proportion of the high and low elements have been trimmed. The trimmed mean of a vector, where we remove 5% of values on the upper and lower extremes of the vector, and also ignoring missing values, is as follows.
<- c(10, 20, NA, 125, 35, 15)
s mean(s, trim = 0.05, na.rm = T)
[1] 41
To use the trim = 0.05
and na.rm = T
arguments when we use lapply
, we supply them as optional arguments after the function name as follows.
<- list(x = c(NA, 10, 11, 12, 1001, -20),
data_vectors y = c(5, 10, 7, 2500, 6),
z = c(2, 4, 1000, 8, 5, 7, NA)
)
lapply(data_vectors, mean, trim = 0.05, na.rm = T)
$x
[1] 202.8
$y
[1] 505.6
$z
[1] 171
Note that because data frames (and tibbles) are essentially lists of vectors, lapply
can be easily used to apply a function to all columns of a data frame.
library(tidyverse)
<- tibble(x = rnorm(10),
data_df y = rnorm(10),
z = rnorm(10))
lapply(data_df, mean) %>%
unlist()
x y z
0.24504010 -0.43967492 -0.05399682
The above function can also be accomplished with dplyr
’s summarise
, except that summarise
returns a tibble.
%>%
data_df summarise(across(everything(), mean))
# A tibble: 1 × 3
x y z
<dbl> <dbl> <dbl>
1 0.245 -0.440 -0.0540
sapply
& vapply
As we’ve seen, lapply
always returns a list. Sometimes, this returned list is a list of single values or list of vectors of the same length and type. In these cases, it would be preferable to convert these lists to vectors or matrices. We saw this in the case of using sawtooth
with lapply
above, where we mentioned that we could manually convert the returned list into a vector using unlist
. Variants of lapply
, sapply
and vapply
, can facilitate doing these conversions. The sapply
function works like lapply
but will attempt to simplify the list as a vector or a matrix if possible. In the following example, we use sapply
to apply sawtooth
to each element of x
.
<- sapply(x, sawtooth)
y head(y, 5)
[1] 0 0 0 0 0
Here, because the list that would have been returned by lapply
, had we used it here, is a list of length N
of single numeric values (or numeric vectors of length 1), sapply
can produce and return a numeric vector of length N
.
If the results of lapply
is a list of numeric vectors, then this list could be simplified to a matrix. In the following example, the list returned by lapply
is like this.
<- tibble(x = rnorm(100),
data_df y = rnorm(100),
z = rnorm(100))
lapply(data_df, quantile)
$x
0% 25% 50% 75% 100%
-2.31932737 -0.70960382 -0.05431694 0.72976278 2.13348636
$y
0% 25% 50% 75% 100%
-2.82300012 -0.48106048 0.04061442 0.61636766 2.13286973
$z
0% 25% 50% 75% 100%
-2.4662529 -0.6690109 -0.1111398 0.5306826 2.1873352
If we use sapply
here instead, the result is a matrix.
sapply(data_df, quantile)
x y z
0% -2.31932737 -2.82300012 -2.4662529
25% -0.70960382 -0.48106048 -0.6690109
50% -0.05431694 0.04061442 -0.1111398
75% 0.72976278 0.61636766 0.5306826
100% 2.13348636 2.13286973 2.1873352
Note that in some cases, sapply
can not simplify the list. For example, in the following example, the list returned by lapply
has elements of different lengths, and sapply
can not simplify it.
<- list(x = rnorm(2),
X y = rnorm(3),
z = rnorm(4)
)sapply(X, function(x) x^2)
$x
[1] 0.0503656 0.1913169
$y
[1] 1.7629691 0.2194715 1.7954867
$z
[1] 2.7736662 0.2904671 0.9987699 0.1166150
The vapply
function is a safer version of sapply
because it specifies the nature of the returned values of each application of the function. For example, we know the default returned value of quantile
will be a numeric vector of length 5. We can specify this as the FUN.VALUE
argument to vapply
.
vapply(data_df, quantile, FUN.VALUE=numeric(5))
x y z
0% -2.31932737 -2.82300012 -2.4662529
25% -0.70960382 -0.48106048 -0.6690109
50% -0.05431694 0.04061442 -0.1111398
75% 0.72976278 0.61636766 0.5306826
100% 2.13348636 2.13286973 2.1873352
mapply
& Map
The functionals lapply
, sapply
, vapply
take a function and apply it to each element in a vector or list. In other words, each element in the vector or list is supplied as the argument to the function. While we may, as we described above, have other arguments to the function set to fixed values for each function application, we can not use lapply
, sapply
or vapply
to apply functions to two or more lists or vectors at the same time. Consider the following function.
<- function(x, k) x^k power
If we had a vector of x
values, and set k
to e.g. 3, we could do the following.
<- c(2, 3, 4, 5)
x sapply(x, power, k=5)
[1] 32 243 1024 3125
However, if we had the vector of x
values and a vector of k
values, and wish to apply each element of x
and the corresponding element of k
to power
, we need to use mapply
, as in the following example.
<- c(2, 3, 4, 5)
x <- c(2, 3, 2, 2)
k mapply(power, x = x, k = k)
[1] 4 27 16 25
This mapply
is therefore equivalent to the following for loop.
for (i in seq_along(x)){
power(x[i], k[i])
}
With mapply
, we can iterate over any number of lists of input arguments simultaneously. As an example, the random number generator rnorm
function takes 3 arguments: n
, mean
, and sd
. In the following, we apply rnorm
to each value of of lists of these three arguments.
<- c(2, 3, 5)
n <- c(10, 100, 200)
mu <- c(1, 10, 10)
sigma mapply(rnorm, n = n, mean = mu, sd = sigma)
[[1]]
[1] 8.941373 9.958401
[[2]]
[1] 87.07172 91.06136 105.41909
[[3]]
[1] 209.6964 197.5501 208.6018 193.2416 194.9996
As we can see, we effectively execute rnorm(n=3, mean=10, sd=1)
, rnorm(n=5, mean=5, sd = 10)
, and so on.
The Map
function works just like mapply
, with minor differences, such as not ever simplifying the results.
set.seed(101)
Map(rnorm, n = n, mean = mu, sd = sigma)
[[1]]
[1] 9.673964 10.552462
[[2]]
[1] 93.25056 102.14359 103.10769
[[3]]
[1] 211.7397 206.1879 198.8727 209.1703 197.7674
As we can see, Map
and mapply
are identical in their usage and in what they do. However, by default, mapply
will attempt to simply its output like sapply
, if possible. We saw this with the use of the power
function with mapply
above. However, if we replace mapply
with Map
, no simplification is applied, and we obtain a list as output.
Functionals with purrr
The purrr
package in the tidyverse
provides functionals like those just covered, but which are consistent with one another in terms of how they are used, and also with how other tidyverse
functions are used. In addition, purrr
provides additional functional tools beyond those in base R. We can load purrr
with library(purrr)
, but it is also loaded by library(tidyverse)
.
map
One of the main tools in purrr
is map
and its variants. It is very similar to lapply
. It takes a list (or vector) and a function, and applies the function to each element of the list. To re-use an example from above, in the following, we apply sawtooth
to each element in a vector x
and return the results in a list.
<- map(x, sawtooth) y
This is identical to how we use lapply
. As with lapply
, we can supply arguments for the function being applied as optional arguments. For example, the trimmed mean example above where we used lapply
can also be done with map
.
map(data_vectors, mean, trim = 0.05, na.rm = T)
$x
[1] 202.8
$y
[1] 505.6
$z
[1] 171
When the list that is returned by map
or its variants can be simplified to a vector, we can use the map_dbl
, map_int
, map_lgl
, map_chr
variants of map
to simplify the list to a vector of doubles, integers, Booleans, or characters, respectively. In the following example, we apply functions that produce integers, logicals, doubles, or characters, to each column of a data frame.
<- list(x = c(1, 2, 3),
data_lst y = c(1L, 2L, 3L, 4L),
z = rnorm(100)
)
map_int(data_lst, length)
x y z
3 4 100
map_lgl(data_lst, is_integer)
x y z
FALSE TRUE FALSE
map_dbl(data_lst, mean)
x y z
2.00000000 2.50000000 -0.03547806
map_chr(data_lst, class)
x y z
"numeric" "integer" "numeric"
purrr style anonymous functions
We saw above that we can use anonymous functions in *apply
family functionals. This can be done in purrr
functionals like the map
family too, as we see in the following example.
map_dbl(data_lst,
function(x) mean(log(abs(x)))
)
x y z
0.5972532 0.7945135 -0.6096516
However, purrr
provides syntactic sugar to allow us to rewrite this as follows.
map_dbl(data_lst,
~ mean(log(abs(.)))
)
x y z
0.5972532 0.7945135 -0.6096516
In other words, in place of the function(x)
, we have ~
, and in place of the anonymous function’s input variable we have .
.
map2
& pmap
When we have two or more sets of input arguments, we can use map2
and pmap
, respectively. Both of these functions also have the _lgl
, _int
, _dbl
, _chr
, and other variants that we saw with map
.
As an example of a map2
function, we’ll use the power
function that takes two input arguments.
<- c(2, 3, 4, 5)
x <- c(2, 3, 2, 2)
k map2_dbl(x, k, power)
[1] 4 27 16 25
As an example of a pmap
function, we can reimplement the rnorm
based sampler that we originally wrote with mapply
. The first argument to pmap
is a list whose length is the number of arguments being passed to the function. If want to iterate over different values of the n
, mean
and sd
arguments for rnorm
, as we did in the example above, we’d set up a list like the following.
<- list(n = c(2, 3, 5),
args mean = c(10, 100, 200),
sd = c(1, 10, 10))
We then use pmap
as follows.
pmap(args, rnorm)
[[1]]
[1] 9.432079 9.953257
[[2]]
[1] 98.43019 116.02242 107.68654
[[3]]
[1] 192.2837 193.6932 191.6972 194.0889 209.8109
Reuse
Citation
@online{andrews2021,
author = {Andrews, Mark},
title = {Iterations in {R} with Lapply (Etc) and Purrr Functionals},
date = {2021-07-06},
url = {https://www.mjandrews.org/notes/functionals/},
langid = {en}
}