-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnsir-bootstrap.qmd
453 lines (309 loc) · 15.3 KB
/
nsir-bootstrap.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
---
output:
html_document:
df_print: paged
code_download: TRUE
toc: true
toc_depth: 1
editor_options:
chunk_output_type: console
---
# Bootstrapping with R
```{r}
install.packages("boot")
library(tidyverse)
library(boot)
```
## Bootstrap Resampling
### **Resampling approach to statistical inference**
- **Resampling** is the creation of new samples based on an observed sample.
- **Bootstrap resampling** is *random sampling with replacement.*
- The **core assumption of the bootstrap** is that the randomness in your data, and therefore the statistical uncertainty in your answer, arises from the process of sampling. While the bootstrap isn't explicitly designed for anything else, it's actually provides a pretty good approximation for *other* common forms of randomness as well such as experimental randomization, measurement error, or intrinsic variability of some natural process (e.g. your heart rate).
{fig-align="center"}
Most of the time, we can't feasibly take repeated samples from the same random process that generated our data, to see how our estimate changes from one sample to the next. But we can repeatedly take *resamples from the sample itself*, and apply our estimator afresh to each notional sample. The variability of the estimates across all these resamples can be then used to approximate our estimator's true sampling distribution.
Bootstrapping infers results for a population from results found on a collection of smaller random samples of that population, using replacement during the sampling process.
{width="579"}
Image Source: *Data Science in R: A Gentle Introduction by J.G. Scott*
### **Key Properties of a Bootstrap**
Each block of `N` resampled data points is called a "bootstrap sample." To bootstrap, we write a computer program that repeatedly resamples our original sample and recomputes our estimate for each bootstrap sample. However, there are two **key properties of bootstrapping**:
1. Each bootstrap sample must be of the same size (N) as the original sample. Remember, we have to approximate the randomness in our data-generating process, and the sample size is an absolutely fundamental part of that process.
2. Each bootstrap sample must be taken **with replacement** from the original sample. The intuition here is that each bootstrap sample will have its own random pattern of duplicates and omissions compared with the original sample, creating *synthetic* sampling variability that approximates *true* sampling variability.
## DIY Bootstrap
Import the data in *NHANES_sleep.csv*. This file contains a sliver of data from the National Health and Nutrition Examination Survey, known as NHANES.
```{r}
NHANES_sleep <- read.csv("NHANES_sleep.csv")
View(NHANES_sleep)
names(NHANES_sleep)
```
The `NHANES_sleep` file contains information on people's gender, age, self-reported race/ethnicity, and home ownership status. It also has a few pieces of health information: 1) the self-reported number of hours each study participant usually gets at night on weekdays or workdays; 2) whether the respondent has smoked 100 or more cigarettes in their life (yes or no); and 3) the self-reported frequency of days per month where the participant felt down, depressed or hopeless.
### Example: sample mean
The first question we'll address is: how well are Americans sleeping, on average?
```{r}
hist(NHANES_sleep$SleepHrsNight)
```
```{r}
sample_mean <- mean(NHANES_sleep$SleepHrsNight, na.rm = TRUE)
sample_mean
```
However, this is just a survey and we clearly have some uncertainty in generalizing this number to the wider American population.
How much? To get a rough idea, let's take a single bootstrap sample to simulate the randomness of our data-generating process, like this:
```{r}
resample <- sample(NHANES_sleep$SleepHrsNight, size=nrow(NHANES_sleep), replace = TRUE)
resample_mean <- mean(resample)
resample_mean
```
```{r}
sampling_error <- sample_mean - resample_mean
sampling_error
sampling_error*60 # time in minutes (approx)
```
This difference represents a sampling error - or more precisely, it represents a *bootstrap* sampling error, which is an approximation to an *actual* sampling error.
So we've already learned something useful: our survey result of 6.88 hours per night could easily differ from the true population average by 3 minutes, just because of the uncertainty inherent to sampling.
If I run the code above, multiple times, I get slightly different means every time due to sampling randomness. So let's compute the sampling error for several iterations to get an average estimate of the bootstrap sampling error:
```{r}
resampled_means_vector <- c()
iterations <- 100
for (i in 1:iterations) {
resample <- sample(NHANES_sleep$SleepHrsNight,
size=nrow(NHANES_sleep), # size should be identical to original data
replace = TRUE) # sample with replacement
resample_mean <- mean(resample)
resampled_means_vector[i] <- resample_mean
}
hist(resampled_means_vector)
mean(resampled_means_vector)
```
This histogram represents our *bootstrap sampling distribution*, which is designed to approximate the *true* sampling distribution.
Now, let's re-calculate the sampling error:
```{r}
errors <- resampled_means_vector - sample_mean
hist(errors)
mean(errors)
mean(errors)*60 # time in minutes (approx)
```
### EXERCISE 1
Provide an estimate of the mean age of female respondents in the NHANES survey by bootstrap resampling 100 times.
```{r}
# subset the data for female respondents
gender_f <-
# calculate the sample size for bootstrapping
size <-
# run the bootstrap
iterations <- 100
resampled_means_vector <- c()
for (i in 1:iterations) {
# your code here
resample <- sample() # fill with appropriate input parameters
resample_mean <- mean(resample)
resampled_means_vector[i] <- resample_mean
}
# bootstrap resample results
hist(resampled_means_vector)
mean(resampled_means_vector)
```
## Using the `{boot}` package
The `{boot}` package provides a convenient and fast (i.e. parallel) way to calculate bootstrapped estimates. The function `boot()` is as follows:
```
?boot
boot(data, # The data as a vector, matrix or data frame
statistic, # A function applied to data returns to estimate your statistic
R) # The number of bootstrap replicates
```
### Boot estimator function
To use it, you have to first create a function which estimates the statistic of interest. For example, if the statistic of interest were the mean of a vector:
```{r}
my_boot_statistic <- function(data, indices) {
return(mean(data[indices]))
}
boot_results <- boot(data = NHANES_sleep$SleepHrsNight,
statistic = my_boot_statistic,
R = 100)
```
Note - the function `my_boot_statistic` has a very specific structure such that you only supply the data and randomly generated indices to it. The random indices are in fact supplied by `boot().`
The R package `boot` repeatedly calls your estimation function, and each time, the bootstrap sample is supplied using an integer vector of indexes. This saves on memory because R is not duplicating the actual data.
### EXERCISE 2
Write an estimator function to use with `boot()` that calculates the correlation between sleep hours and age from the NHANES survey. Hint: use the `cor()` function to get the correlation for two vectors.
```{r}
my_boot_statistic <- function() {
}
```
### Boot result object
Let's look at the `boot_results` object generated as output the `boot()` function
```{r}
str(boot_results)
```
`t0` stores the sample estimate of your statistic (eg mean, median, sd, etc)
`t1` stores the bootstrap sampling estimates for `R` replicates
Let's plot the results of the bootstrapped samples' estimates:
```{r}
# method 1
hist(boot_results$t)
# method 2
plot(boot_results)
```
`{boot}` has it's own plotting function `plot()` which plots the histogram of the estimated statistic from each resample, and also shows you how it compares to a Normal distribution using a quantile-quantile plot.
Let's see where this Q-Q diagnostic plot might be helpful.
```{r}
hist(NHANES_sleep$Age)
# estimate mean age with bootstrap resamples
my_boot_statistic <- function(data, indices) {
return(mean(data[indices]))
}
# boot strap with 25 resamples
boot_results <- boot(data = NHANES_sleep$Age,
statistic = my_boot_statistic,
R = 25)
plot(boot_results)
```
```{r}
# boot strap with 250 resamples
boot_results <- boot(data = NHANES_sleep$Age,
statistic = my_boot_statistic,
R = 250)
plot(boot_results)
```
A "good" number of resamples (`R`) depends on the structure of the underlying data and the original data size. The diagnostic plots help you make sure that you have enough resamples to make reasonable estimates of your statistic. There is no limit to the number of resamples - other than computing power and time. Of course, at some point increasing the number of resamples further does not significantly impact your estimate.
### EXERCISE 3
Let's use the `iris` dataset this time. Compute a bootstrapped estimate of the correlation coefficient between `Sepal.Length` and `Petal.Length` for species "setosa". Use the diagnostic plots to make sure you have enough resamples.
```{r}
View(iris)
names(iris)
```
Hint: Break the problem down into steps. Use the `cor()` function to compute a correlation coefficient.
```{r}
# step 1 - define your statistic of interest
my_boot_statistic <- function() {
}
# step 2 - boot strap with R resamples
boot_results <- boot(data = ,
statistic = ,
R = )
# step 3 - plot the diagnostic plots
plot(boot_results)
# step 4 - change the R value if needed and plot results again
boot_results <- boot(data = ,
statistic = ,
R = )
plot(boot_results)
```
### Confidence intervals with `boot.ci()`
`{boot}` contains a very convenient function called `boot.ci()` to calculate confidence intervals for your statistic.
Since we already have the resamples saved in `boot_results`, let's compute 95% confidence intervals for the correlation coefficient for `Sepal.Length` and `Petal.Length` for the species "setosa".
```{r}
boot.ci(boot.out = boot_results, # the boot function output
conf = 0.95, # alpha / level of confidence
type = "all") # method chosen to calculate the C.I.
```
Note - For studentized confidence intervals to work, the statistic function needs to return the statistic and also the estimated variance.
You can also specify a single method, eg "perc" for the percentile method.
```{r}
ci <- boot.ci(boot.out = boot_results, # the boot function output
conf = 0.95, # (1-alpha) or level of confidence
type = "perc") # method chosen to calculate the C.I.
str(ci)
```
### EXERCISE 4
Estimate the mean and 90% confidence intervals for the *difference* in sleep hours between male and female respondents in the `NHANES_sleep` dataset.
```{r}
# step 1 - define your statistic of interest
my_boot_statistic <- function(){
}
# step 2 - get the boot result object
# step 3 - calculate the confidence intervals using the boot results object
```
## When not to bootstrap
The exercise above shows you how to calculate more complicated statistics with bootstrap. This is the power of bootstrapping - you can tailor your function to mimic your experiments or data generation processes very closely!
However, there are several scenarios where the bootstrap procedure can fail:
**1**. Generally, it is observed that for small sample sizes less than 10, a bootstrapped sample is not reliable.
**2**. The distributions that have infinite second moments (eg: the Zipf distribution).
**3**. When estimating extreme values such as the minimum or maximum.
**4**. At the time of unstable AR (auto-regressive) processes.
Let's see an example where we attempt to bootstrap the minimum value of sleep hours in `NHANES_sleep`
```{r}
hist(NHANES_sleep$SleepHrsNight)
# estimate mean age with bootstrap resamples
my_boot_statistic <- function(data, indices) {
resampled_data <- data[indices]
return(min(resampled_data))
}
# boot strap with 100 resamples
boot_results <- boot(data = NHANES_sleep$SleepHrsNight,
statistic = my_boot_statistic,
R = 100)
plot(boot_results)
```
As you can see, the diagnostic plots show you that bootstrapping did not work well to get an estimate of the minimum.
## Answers to exercises
### EXERCISE 1
Provide an estimate of the mean age of female respondents in the NHANES survey by bootstrap resampling 100 times
```{r}
# subset the data for female respondents
gender_f <- NHANES_sleep[NHANES_sleep$Gender=="female",]
# calculate the sample size for bootstrapping
size <- nrow(gender_f)
# run the bootstrap
iterations <- 100
resampled_means_vector <- c()
for (i in 1:iterations) {
# your code here
resample <- sample(gender_f$Age, size, replace = TRUE)
resample_mean <- mean(resample)
resampled_means_vector[i] <- resample_mean
}
# bootstrap resample results
hist(resampled_means_vector)
mean(resampled_means_vector)
```
### EXERCISE 2
Write an estimator function to use with `boot()` that calculates the correlation between sleep hours and age from the NHANES survey. Hint: use the `cor()` function.
```{r}
my_boot_statistic <- function(data, indices) {
resampled_data <- data[indices,]
corr_value <- cor(resampled_data$SleepHrsNight, resampled_data$Age)
return(corr_value)
}
# Bonus: let's see this function in action
results <- boot(data = NHANES_sleep,
statistic = my_boot_statistic,
R = 100)
results
```
### EXERCISE 3
Let's use the `iris` dataset this time. Compute a bootstrapped estimate of the correlation coefficient between `Sepal.Length` and `Petal.Length` for the species "Setosa".
```{r}
# step 1 - define your statistic of interest - we did something similar in exercise 2
my_boot_statistic <- function(data, indices) {
resampled_data <- data[indices,]
corr_value <- cor(resampled_data$Sepal.Length, resampled_data$Petal.Length)
return(corr_value)
}
# step 2 - boot strap with R resamples
boot_results <- boot(data = iris[iris$Species=="setosa",],
statistic = my_boot_statistic,
R = 100)
# step 3 - plot the diagnostic plots
plot(boot_results)
boot_results
# step 4 - change the R value if needed and plot results again
boot_results <- boot(data = iris[iris$Species=="setosa",],
statistic = my_boot_statistic,
R = 1000)
plot(boot_results)
boot_results
```
### EXERCISE 4
Estimate the mean and 90% confidence intervals for the *difference* in sleep hours between male and female respondents in the `NHANES_sleep` dataset.
```{r}
# step 1 - define your statistic of interest
my_boot_statistic <- function(data, indices){
resampled_data <- data[indices,]
mean_sleephrs_male <- mean(resampled_data[resampled_data$Gender=="male",]$SleepHrsNight)
mean_sleephrs_female <- mean(resampled_data[resampled_data$Gender=="female",]$SleepHrsNight)
diff <- mean_sleephrs_male - mean_sleephrs_female
return(diff)
}
# step 2 - get the boot result object
boot_results <- boot(NHANES_sleep, my_boot_statistic, R = 1000)
# step 3 - calculate the confidence intervals using the boot results object
boot.ci(boot_results, conf = 0.90, type = "perc")
```