set.seed(10101) # for reproducibility
library(tidyverse)
theme_set(theme_classic())
<- -5; vsigma <- 1.5; lambda <- 5;
mu
<- tibble(x = seq(-10, 15, length.out = 1000),
exgauss_df p = emg::demg(x, mu, vsigma, lambda)
)
# the population mean
<- mu + 1/lambda
theta
# density at mean
<- emg::demg(theta, mu, vsigma, lambda)
dtheta
ggplot(exgauss_df,
mapping = aes(x = x, y = p)
+ geom_line() +
) geom_segment(x = theta,
xend = theta,
y = 0,
yend = dtheta,
colour = 'red', size = 0.5, linetype = 'dashed')
Bootstrap Confidence Intervals
Bootstrapping is any one of a set of inter-related methods for statistical inference that uses a finite sample from a population as a proxy to the population itself. Using a sample as a proxy for the population in order to infer properties about the population is the reason it is called bootstrapping. Bootstrapping was first proposed in Efron (1979).
The most widely used and familiar bootstrapping methods are extremely simple, both conceptually and practically, as we will see. However, there are in fact almost countless different variants of bootstrapping methods, and there are many books solely devoted to describing these methods in detail (see, for example, Efron and Tibshirani 1993; Davison and Hinkley 1997; Chernick 2008). Here, we will describe how bootstrapping can be used to calculate confidence intervals. We will focus on the simplest methods, but these also happen to be the most widely and familiar in practice. As we explain below, the key feature of these methods involves repeatedly resampling from a sample, calculating statistics using these repeated samples, and then using various properties to the distribution of these statistics to calculate confidence intervals.
Background: Estimators, sampling distributions, and confidence intervals
To understand bootstrapped confidence intervals, it helps to clarify some background details and context related to frequentist statistical inference.
Let us assume we have observed \(n\) values \[ y_{1:n} = y_1, y_2 \ldots y_i \ldots y_n, \] where each \(y_i\) is a value on the real line, i.e. \(y_i \in \mathbb{R}\) for all \(i \in 1\ldots n\). For simplicity, let us also assume that these \(n\) values are drawn independently from some unknown probability distribution over \(\mathbb{R}\), which we will denote by \(f\).
Suppose that we wish to estimate the mean of the probability distribution \(f\), which we will denote by \(\theta\). The obvious estimator of \(\theta\) is the sample mean of \(y_{1:n}\), which we will denote by \(\hat{\theta}\). Of course, the value of the sample mean \(\hat{\theta}\) is not always going to be exactly the same value as the population mean \(\theta\). For every given sample of \(n\) values from \(f\), the sample mean \(\hat{\theta}\) will be slightly different. Sometimes it will be higher, and sometimes lower, than the population mean \(\theta\). The sampling distribution of \(\hat{\theta}\) tells us how much \(\hat{\theta}\) varies across repeated samples of \(n\) values from \(f\). One important property of the sampling distributions is its standard deviation, which is known as the standard error. The standard error is the key to many types of inferential statistical, including the confidence interval. For example, if a sampling distribution is normally distributed, which will be the case in general as \(n \to \infty\), the 95% confidence interval for \(\theta\) is given by \[ \hat{\theta} \pm 1.96 \times \mathrm{se}, \] where \(\mathrm{se}\) is the standard deviation of the sampling distribution of the sample mean \(\hat{\theta}\), and where the 1.96 multiplier is due to the 2.5th an 97.5th percentiles in a normal distribution being 1.96 standard deviations below and above, respectively, the mean. More generally, in this normally distributed sampling distribution case, the \(1-\alpha\) confidence interval is \[ \hat{\theta} \pm z_{1-\alpha/2} \times \mathrm{se}, \] where \(z_{1-\alpha/2}\) is the value below which lies \(1-\alpha/2\) of the mass in a standard normal distribution. This method of calculation of the confidence interval is sometimes known as the standard interval method.
As a concrete illustration of this, let us assume that the probability distribution \(f\) is an exponentially modified Gaussian distribution, also known as an exGaussian distribution. This distribution is roughly like a skewed normal distribution. More precisely, the exGaussian distribution is the distribution of the sum of two independent random variables, one of which is normally distributed, and the other has an exponential distribution. It has three parameters, which we will denoted by \(\mu\), \(\sigma\), and \(\lambda\). These are, respectively, the mean and standard deviation of the normally distributed variable, and the rate parameter of the exponential distribution. A number of R package provide functions for its probability density function, and one such package is emg
, where the density function is demg
. In the next figure, we plot the density function of the exGaussian when \(\mu = -5\), \(\sigma = 1.5\), and \(\lambda = 5\), and also show the mean of this distribution. This is produced by the following code:
The sampling distribution of the sample mean for the exGaussian with parameters \(\mu = -5\), \(\sigma = 1.5\), \(\lambda = 5\), can be approximated to an arbitrary degree of accuracy by repeatedly sampling \(n\) values from this distribution, and then calculating the sample mean for each sample. The following code repeatedly samples \(n=25\) samples from this exGaussian 10,000 times, and calculates the sample mean in each case.
<- replicate(
mean_sampling_distribution 10000,
::remg(25, mu, vsigma, lambda) %>% mean()
emg )
The histogram of this sampling distribution is shown in the next figure, where we also show the population mean by the dashed red vertical line.
tibble(mean_sampling_distribution) %>%
ggplot(aes(x = mean_sampling_distribution)) +
geom_histogram(bins = 50, colour = 'white', fill = 'grey60') +
geom_vline(xintercept = theta, colour = 'red', linetype = 'dashed')
The standard deviation of this sampling distribution, i.e. the standard error, can be calculated simply by calculating the sample standard deviation of the 10,000 sample means:
<- sd(mean_sampling_distribution)
se se
[1] 0.3033802
Now that we know the standard error \(\mathrm{se}\), if we observe a particular sample of size \(n=25\) from the exGaussian with parameters \(\mu = -5\), \(\sigma = 1.5\), \(\lambda = 5\), if we assume that the sampling distribution is approximately normal, the mean of this sample, \(\hat{\theta}\), can be used to give us the approximate 95% confidence interval for \(\theta\) as follows: \[ \hat{\theta} \pm 1.96 \times \mathrm{se}. \] For example, we can obtain \(n=25\) samples from this exGaussian as follows:
<- emg::remg(25, mu, vsigma, lambda) y
The mean of this sample is
mean(y)
[1] -5.186576
and so the approximate 95% confidence interval based on this sample is \[ (-5.78, -4.59), \] which we can calculate in R as follows:
mean(y) + c(-1, 1) * 1.96 * se
[1] -5.781202 -4.591951
Obviously, the procedure we just followed to calculate approximate confidence intervals for \(\theta\) can not be followed in practice. In practice, we do not know the population probability distribution \(f\), so can not numerically calculate the sampling distribution of an estimator like the sample mean. In practice, we can still sometimes rely on theoretical results to estimate the standard error or the sampling distribution generally. For example, if the parameter of interest is the mean of a normal distribution, and the estimator is the sample mean, widely used theoretical results tell us that the standard error is \(\sigma/\sqrt{n}\), where \(\sigma\) is the standard deviation of the normal distribution, which can be approximated by \(s/\sqrt{n}\), where \(s\) is the sample standard deviation. However, theoretical results like this are not always possible. For example, in the case of the exGaussian distribution, or for many other distributions, we do not have an analytic form for the sampling distribution for the sample mean.
Bootstrapping
In cases where we do not have analytic form for the sampling distribution of an estimator, we can use bootstrapping. In bootstrapping, we use the empirical distribution of the data, which we will denote \(\hat{f}\), as an approximation to the population distribution \(f\). The empirical distribution is simply the probability distribution of the data itself. In other words, for a sample of \(n\) values, the empirical distribution puts a point-mass of value \(1/n\) at each of the \(n\) values. For the y
sample of \(n=25\) values that we obtained above, the plot of the empirical distribution is as follows:
tibble(y, p = 1/length(y)) %>%
ggplot() +
geom_segment(aes(x = y, xend = y, y = 0, yend = p)) +
geom_point(aes(x = y, y = p))
With \(\hat{f}\) as a our approximation to \(f\), we can now repeatedly sample \(n\) values from \(\hat{f}\), just as we did from \(f\) above, and calculate the sample mean \(\hat{\theta}\) of each sample. This set of values, which we call the bootstrap distribution is then the approximation to the sampling distribution of the sample mean. Sampling \(n\) values from the empirical distribution \(\hat{f}\) is identical to randomly sampling with replacement \(n\) values from the data itself. In R, to randomly sample with replacement \(n = 25\) values from the data vector y
, we can simply use the function sample
with the replace = TRUE
optional argument as follows:
# sample n values, where n is the the length of the vector `y`,
# from the empirical distribution of `y`
sample(y, replace = TRUE)
[1] -6.431721 -2.737858 -3.225437 -6.518722 -6.037151 -6.431721 -5.956708
[8] -6.242245 -2.592939 -2.132212 -9.698413 -6.037151 -5.484444 -6.242245
[15] -3.225437 -5.010289 -5.484444 -5.956708 -6.882356 -5.545034 -5.545034
[22] -5.637776 -7.271790 -3.225437 -7.271790
We can repeat this sampling 10,000 times, as we did above when we sample from the population distribution, and then calculate the sample mean, as follows:
<- replicate(
bootstrap_distribution_of_mean 10000,
sample(y, replace = TRUE) %>% mean()
)
The histogram of this bootstrap distribution is in the next figure, where we also show the population mean \(\theta\) with a vertical red line and the sample mean \(\hat{\theta}\) in a vertical blue line, which is obtained by the following code:
tibble(bootstrap_distribution_of_mean) %>%
ggplot(aes(x = bootstrap_distribution_of_mean)) +
geom_histogram(bins = 50, colour = 'white', fill = 'grey60') +
geom_vline(xintercept = theta, colour = 'red', linetype = 'dashed') +
geom_vline(xintercept = mean(y), colour = 'blue', linetype = 'dashed')
Note how the bootstrap distribution is not centred around \(\theta\), but is centred around \(\hat{\theta}\). This is different to the true sampling distribution above, which is, at least in this example, centered around \(\theta\). This, in fact, turns out to be an important result, as will see below.
We can now use this bootstrap distribution of the estimator \(\hat{\theta}\) to calculate confidence intervals for \(\theta\). There are, in general, many ways to do this, but the two simplest are the standard interval method and the percentile method. The percentile method is arguably the most widely used bootstrapping method of any kind, so much so that it often seen as the definition of bootstrapping in general.
Standard interval method
The standard deviation of the bootstrap distribution, which is our bootstrap based approximation of the standard error, is calculated simply as follows:
<- sd(bootstrap_distribution_of_mean)
boot_se boot_se
[1] 0.3333728
and so the bootstrapped confidence interval, using the standard interval method described above, is calculated as follows:
mean(y) + c(-1, 1) * 1.96 * sd(bootstrap_distribution_of_mean)
[1] -5.839987 -4.533166
As we can see, this is very similar to the confidence interval calculated above where we used the standard deviation of the true, and in general unknown, sampling distribution.
Percentile method
We can, however, also calculate the bootstrap confidence interval just from the percentiles of the bootstrap distribution. This is known as the percentile method for calculating the bootstrap confidence interval (see, for example, Efron and Tibshirani 1993, chap. 13; Efron and Hastie 2016, chap. 11). For example, the 95% bootstrap confidence interval calculated using the percentile method is obtained by finding the 2.5th and 97.5th percentiles of the bootstrap distribution. Using our bootstrap distribution, the percentile method 95% bootstrap confidence is calculated as follows:
quantile(bootstrap_distribution_of_mean,
probs = c(0.025, 0.975))
2.5% 97.5%
-5.851427 -4.537473
Clearly, this is very similar to the previous method that used the bootstrapped standard error.
The percentile method is very simple, and very generally applicable. For any estimator of interest, we can create the bootstrap distribution of this estimator by, as we just demonstrated, repeatedly sampling from the empirical distribution, calculate the estimator for each sample, and the find \(\alpha/2\)th percentile and the \(1-\alpha/2\)th percentile of the bootstrap distribution, which gives us the \(1-\alpha\) bootstrap confidence interval for the parameter that we are estimating.
For example, let us suppose we want confidence intervals for the median, the 10% trimmed mean, the standard deviation, the skewness, or any other parameters or properties of the population, we can proceed by first repeatedly sampling from the empirical distribution \(\hat{f}\):
<- replicate(
bootstrap_samples 10000,
sample(y, replace = TRUE),
simplify = FALSE
)
Note that by specifying simplify = FALSE
as an optional argument to replicate
, the resulting bootstrap_samples
R object is a list with 10,000 elements, each of which is a vector of 25 values. To obtain the 95% confidence interval for the, for example, median, we can use map_dbl
to apply the function median
to all 10,000 vectors of 25 values, which returns a vector of 10,000 elements, which can then be passed to quantile
:
%>% map_dbl(median) %>% quantile(probs = c(0.025, 0.975)) bootstrap_samples
2.5% 97.5%
-5.956708 -4.754244
We can perform this calculation for other estimators by replacing the function median
with other functions. We can put these functions in a list and iterate over them, passing them to the
map_dfr(
list(median = median,
trim_mean = ~mean(., trim = 0.1),
stdev = sd,
skewness = moments::skewness),
~map_dbl(bootstrap_samples, .) %>%
quantile(probs = c(0.025, 0.975)),
.id = 'estimator'
)
# A tibble: 4 × 3
estimator `2.5%` `97.5%`
<chr> <dbl> <dbl>
1 median -5.96 -4.75
2 trim_mean -5.82 -4.47
3 stdev 1.18 2.19
4 skewness -0.859 1.03
Why the percentile method works?
The percentile method is, in general, a relatively crude bootstrapping method. Although it is generally better than the standard interval method, there are still many other bootstrapping methods that are more precise (see, for example, Efron 1987; Diciccio and Romano 1988; DiCiccio and Efron 1996). Nonetheless, it is extremely simple, both conceptually and practically. Moreover, in practice, a relatively crude approximation may well be sufficient, and if this can be achieved almost effortlessly in a very wide range of problems, that entails that the method is practically very valuable. It is interesting to consider, therefore, why this very simple method works at all.
Let us begin by assuming for simplicity that the sampling distribution of \(\hat{\theta}\) is normally distributed with a mean of \(\theta\). In other words, we have \[ \hat{\theta} \sim N(\theta, \tau^2), \] where \(\tau\) is the standard error. As mentioned above, from this we can construct the \(1-\alpha\) confidence interval for \(\theta\) as follows: \[ \hat{\theta} \pm z_{\alpha/2} \times \tau. \] The values of \(\hat{\theta} + z_{\alpha/2} \tau\) and \(\hat{\theta} - z_{\alpha/2} \tau\) are, respectively, the \(\alpha/2\)th and the \(1-\alpha/2\)th percentiles of the normal distribution centred at \(\hat{\theta}\) and whose standard deviation is \(\tau\). If the bootstrap distribution of \(\hat{\theta}\) is approximately normal and centred at \(\hat{\theta}\) and with a standard deviation of approximately \(\tau\), as was the case with the example above, then the \(\alpha/2\)th and the \(1-\alpha/2\)th percentiles of this distribution will be approximately equal to the confidence interval \(\hat{\theta} + z_{\alpha/2} \times \tau\) and \(\hat{\theta} - z_{\alpha/2} \times \tau\).
In general, we can not assume that the sampling distribution of \(\hat{\theta}\) is normally distributed. However, if there is a invertible function \(g\) such that \(\phi = g(\theta)\) and \(\hat{\phi} = g(\phi)\), and if \(\hat{\phi} \sim N(\phi, \tau_{\phi})\), then \(\hat{\phi} \pm z_{\alpha/2} \times \tau_\phi\) is the \(1-\alpha\) confidence interval on \(\phi\). If the bootstrap of \(\hat{\phi}\) is approximately normal and centred at \(\hat{\phi}\) and with a standard deviation of approximately \(\tau_\phi\), then the \(\alpha/2\)th and the \(1-\alpha/2\)th percentiles of this distribution, which we will denote by \(\hat{\phi}_{\alpha/2}\) and \(\hat{\phi}_{1-\alpha/2}\), respectively, will be approximately equal to the \(1-\alpha\) confidence interval on \(\phi\). If, by the definition of the confidence interval, \(P(\hat{\phi}_{\alpha/2} \leq \phi \leq \hat{\phi}_{1-\alpha/2}) = 1-\alpha\) then \(P(m^{-1}(\hat{\phi}_{\alpha/2}) \leq m^{-1}(\phi) \leq m^{-1}(\hat{\phi}_{1-\alpha/2})) = 1-\alpha\), and \(m^{-1}(\hat{\phi}_{\alpha/2})\) and \(m^{-1}(\hat{\phi}_{1-\alpha/2})\) are the \(\alpha/2\)th and the \(1-\alpha/2\)th percentiles of the bootstrap distribution of \(\hat{\theta} = m^{-1}(\hat{\phi})\).
This means that we do not need to know the function \(m\), we just need to assume that \(m\) exists, or exists approximately. In other words, if we know or can assume that some function \(m\) exists such that \(m(\hat{\theta})\) is normally distributed, or approximately normally distributed, around \(m(\theta)\), then the \(\alpha/2\)th and the \(1-\alpha/2\)th percentiles of the bootstrap distribution of \(\hat{\theta}\) will provide the approximate \(1-\alpha\) confidence interval for \(\theta\).
Bootstrapping in practice in R
In the examples above with used a combination of replicate
and sample
to make our bootstrap samples. This is already very simple, but there are other functions in various packages in R that can make things even easier. One such function is bootstrap
in the modelr
package. This is especially useful with working with data frames. The modelr
package is part of the tidyverse
package collection, so when tidyverse
is installed, so too is modelr
. It is not loaded automatically when we do library(tidyverse)
, so we must load it explicitly in the usual manner:
library(modelr)
The modelr::bootstrap
function can be used to bootstrap sample a data frame. For example, if we want to draw 10 bootstrap samples from the cars
data set, we could do the following:
bootstrap(cars, n = 10)
# A tibble: 10 × 2
strap .id
<list> <chr>
1 <resample [50 x 2]> 01
2 <resample [50 x 2]> 02
3 <resample [50 x 2]> 03
4 <resample [50 x 2]> 04
5 <resample [50 x 2]> 05
6 <resample [50 x 2]> 06
7 <resample [50 x 2]> 07
8 <resample [50 x 2]> 08
9 <resample [50 x 2]> 09
10 <resample [50 x 2]> 10
A tibble is returned with a list column named strap
. Each element of this column is a single bootstrap sample from cars
. We can iterate over the elements of the strap
column in many ways, but perhaps the easiest, at least the easiest tidyverse
way, is to use the map
functional from purrr
, which we also used in some examples above. For example, if wanted to calculate bootstrap confidence intervals for \(R^2\) in a linear model predicting dist
from speed
, we would execute lm(dist ~ speed)
for each bootstrap sample and then extract the \(R^2\) value from that model. This could be done as follows, where we use 10,000 bootstrap samples:
<- function(m) summary(m)$r.sq
get_rsq
<- bootstrap(cars, n = 10000) %>%
boot_rsq
{map_dbl(.$strap,
~lm(dist ~ speed, data = .) %>% get_rsq()
) }
The histogram of this bootstrap distribution, where we show the lower and upper bound of the confidence interval is in the next figure, which can be produced by the following code.
tibble(boot_rsq) %>%
ggplot(aes(x = boot_rsq)) +
geom_histogram(bins = 50, colour = 'white', fill = 'grey60') +
geom_vline(xintercept = quantile(boot_rsq, probs = 0.025), colour = 'red') +
geom_vline(xintercept = quantile(boot_rsq, probs = 0.975), colour = 'red') +
labs(x = latex2exp::TeX('$R^2$'))
Given that we might like to obtain the bootstrap distribution for many different quantities in any lm
model, we can make a function that takes an lm
model and evaluates it using bootstrap samples from its data:
<- function(model, n = 10000){
bootstrap_lm <- bootstrap(model$model, n)
boot map(boot$strap, ~lm(formula(model), data = .))
}
In this, we rely on the fact that the data used in an lm
model is available using the model
attribute of the lm
object. Likewise, the formula used in the lm
model is available using the function formula
applies to the lm
object. Thus, we too obtain bootstrapped versions from any lm
model, we can do the following for example:
<- lm(dist ~ speed, data = cars)
M <- bootstrap_lm(M) bootstrapped_M
With this, using purrr::map
and related functional, we can calculate the bootstrap distribution of any function. For example, we wanted the bootstrap distribution of the residual standard deviation, we could do the following, which uses the sigma
function for extracting the residual standard deviation from an lm
model.
<- map_dbl(bootstrapped_M, sigma) bootstrapped_sigma
tibble(bootstrapped_sigma) %>%
ggplot(aes(x = bootstrapped_sigma)) +
geom_histogram(bins = 50, colour = 'white', fill = 'grey60') +
geom_vline(xintercept = quantile(bootstrapped_sigma, probs = 0.025), colour = 'red') +
geom_vline(xintercept = quantile(bootstrapped_sigma, probs = 0.975), colour = 'red') +
labs(x = latex2exp::TeX('Residual standard deviation: $\\sigma$'))
We can then also use bootstrapped_M
to calculate the bootstrap confidence intervals for any set of functions in a list. As an example, in the following code, we calculate the bootstrap confidence intervals for the residual standard deviation, \(R^2\), adjusted \(R^2\), and the AIC:
map_dfr(
list(residual_sd = sigma,
r2 = ~summary(.)$r.sq,
adj_r2 = ~summary(.)$adj.r.sq,
aic = AIC),
~map_dbl(bootstrapped_M, .) %>%
quantile(probs = c(0.025, 0.975)),
.id = 'estimator'
)
# A tibble: 4 × 3
estimator `2.5%` `97.5%`
<chr> <dbl> <dbl>
1 residual_sd 11.2 18.5
2 r2 0.489 0.780
3 adj_r2 0.478 0.775
4 aic 388. 438.
References
Reuse
Citation
@online{andrews2021,
author = {Andrews, Mark},
title = {Bootstrap {Confidence} {Intervals}},
date = {2021-07-06},
url = {https://www.mjandrews.org/notes/bootstrapping/},
langid = {en}
}