-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-sim.qmd
847 lines (591 loc) · 30.5 KB
/
09-sim.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
# Probability & Simulation {#sim .incomplete-chapter}
<div class="right meme"><img src="images/memes/sim.jpg"
alt="Morpheus from The Matrix. Top text: What if I told you; Bottom text: Ur in a simulation, inside a simulation, inside another simulation, inside a taco, inside a taco cat, inside a Taco Bell, inside an Arby's. Inside another simulation" /></div>
## Intended Learning Outcomes {#sec-ilo-sim - .ilo}
- [ ] Generate and plot data randomly sampled from common distributions
- [ ] Generate related variables from a multivariate distribution
- [ ] Define the following statistical terms: [p-value](#p-value), [alpha](#alpha), [power](#power), smallest effect size of interest ([SESOI](#sesoi)), [false positive](#false-pos) (type I error), [false negative](#false-neg) (type II error), confidence interval ([CI](#conf-inf))
- [ ] Test sampled distributions against a null hypothesis
- [ ] Calculate power using iteration and a sampling function
## Setup {#sec-setup-sim -}
1. Open your `reprores` project
1. Create a new quarto file called `09-sim.qmd`
1. Update the YAML header
1. Replace the setup chunk with the one below:
```{r}
#| echo: fenced
#| label: setup
#| include: false
library(tidyverse) # for data wrangling
# devtools::install_github("debruine/faux")
library(faux) # data simulation
library(plotly) # create a 3D plot to visualise correlations
# MASS::mvrnorm() is used without loading MASS
set.seed(8675309) # makes sure random numbers are reproducible
```
Simulating data is a very powerful way to test your understanding of statistical concepts. We are going to use `r glossary("simulation", "simulations")` to learn the basics of `r glossary("probability")`.
## Univariate Distributions
First, we need to understand some different ways data might be distributed and how to simulate data from these distributions. A `r glossary("univariate")` distribution is the distribution of a single variable.
### Uniform Distribution {#uniform}
The `r glossary("uniform distribution")` is the simplest distribution. All numbers in the range have an equal probability of being sampled.
::: {.try data-latex=""}
Take a minute to think of things in your own research that are uniformly distributed.
:::
#### Continuous distribution
`runif(n, min=0, max=1)`
Use `runif()` to sample from a continuous uniform distribution.
```{r runif}
u <- runif(100000, min = 0, max = 1)
# plot to visualise
ggplot() +
geom_histogram(aes(u), binwidth = 0.05, boundary = 0,
fill = "white", colour = "black")
```
#### Discrete
`sample(x, size, replace = FALSE, prob = NULL)`
Use `sample()` to sample from a `r glossary("discrete")` distribution.
You can use `sample()` to simulate events like rolling dice or choosing from a deck of cards. The code below simulates rolling a 6-sided die 10000 times. We set `replace` to `TRUE` so that each event is independent. See what happens if you set `replace` to `FALSE`.
```{r sample-replace, fig.cap = "Distribution of dice rolls."}
rolls <- sample(1:6, 10000, replace = TRUE)
# plot the results
ggplot() +
geom_histogram(aes(rolls), binwidth = 1,
fill = "white", color = "black")
```
You can also use sample to sample from a list of named outcomes.
```{r sample-list}
pet_types <- c("cat", "dog", "ferret", "bird", "fish")
sample(pet_types, 10, replace = TRUE)
```
Ferrets are a much less common pet than cats and dogs, so our sample isn't very realistic. You can set the probabilities of each item in the list with the `prob` argument.
```{r sample-prob}
pet_types <- c("cat", "dog", "ferret", "bird", "fish")
pet_prob <- c(0.3, 0.4, 0.1, 0.1, 0.1)
sample(pet_types, 10, replace = TRUE, prob = pet_prob)
```
### Binomial Distribution {#binomial}
The `r glossary("binomial distribution")` is useful for modelling binary data, where each observation can have one of two outcomes, like success/failure, yes/no or head/tails.
`rbinom(n, size, prob)`
The `rbinom` function will generate a random binomial distribution.
* `n` = number of observations
* `size` = number of trials
* `prob` = probability of success on each trial
Coin flips are a typical example of a binomial distribution, where we can assign heads to 1 and tails to 0.
```{r rbinom-fair}
# 20 individual coin flips of a fair coin
rbinom(20, 1, 0.5)
```
```{r rbinom-bias}
# 20 individual coin flips of a baised (0.75) coin
rbinom(20, 1, 0.75)
```
You can generate the total number of heads in 1 set of 20 coin flips by setting `size` to 20 and `n` to 1.
```{r rbinom-size}
rbinom(1, 20, 0.75)
```
You can generate more sets of 20 coin flips by increasing the `n`.
```{r rbinom-n}
rbinom(10, 20, 0.5)
```
You should always check your randomly generated data to check that it makes sense. For large samples, it's easiest to do that graphically. A histogram is usually the best choice for plotting binomial data.
```{r sim_flips}
flips <- rbinom(1000, 20, 0.5)
ggplot() +
geom_histogram(
aes(flips),
binwidth = 1,
fill = "white",
color = "black"
)
```
::: {.try data-latex=""}
Run the simulation above several times, noting how the histogram changes. Try changing the values of `n`, `size`, and `prob`.
:::
### Normal Distribution {#normal}
`rnorm(n, mean, sd)`
We can simulate a `r glossary("normal distribution")` of size `n` if we know the `mean` and standard deviation (`sd`). A density plot is usually the best way to visualise this type of data if your `n` is large.
```{r rnorm}
dv <- rnorm(1e5, 10, 2)
# proportions of normally-distributed data
# within 1, 2, or 3 SD of the mean
sd1 <- .6827
sd2 <- .9545
sd3 <- .9973
ggplot() +
geom_density(aes(dv), fill = "white") +
geom_vline(xintercept = mean(dv), color = "red") +
geom_vline(xintercept = quantile(dv, .5 - sd1/2), color = "darkgreen") +
geom_vline(xintercept = quantile(dv, .5 + sd1/2), color = "darkgreen") +
geom_vline(xintercept = quantile(dv, .5 - sd2/2), color = "blue") +
geom_vline(xintercept = quantile(dv, .5 + sd2/2), color = "blue") +
geom_vline(xintercept = quantile(dv, .5 - sd3/2), color = "purple") +
geom_vline(xintercept = quantile(dv, .5 + sd3/2), color = "purple") +
scale_x_continuous(
limits = c(0,20),
breaks = seq(0,20)
)
```
::: {.info data-latex=""}
Run the simulation above several times, noting how the density plot changes. What do the vertical lines represent? Try changing the values of `n`, `mean`, and `sd`.
:::
### Poisson Distribution {#poisson}
The `r glossary("Poisson distribution")` is useful for modelling events, like how many times something happens over a unit of time, as long as the events are independent (e.g., an event having happened in one time period doesn't make it more or less likely to happen in the next).
`rpois(n, lambda)`
The `rpois` function will generate a random Poisson distribution.
* `n` = number of observations
* `lambda` = the mean number of events per observation
Let's say we want to model how many texts you get each day for a whole. You know that you get an average of 20 texts per day. So we set `n = 365` and `lambda = 20`. Lambda is a `r glossary("parameter")` that describes the Poisson distribution, just like mean and standard deviation are parameters that describe the normal distribution.
```{r rpois}
texts <- rpois(n = 365, lambda = 20)
ggplot() +
geom_histogram(
aes(texts),
binwidth = 1,
fill = "white",
color = "black"
)
```
So we can see that over a year, you're unlikely to get fewer than 5 texts in a day, or more than 35 (although it's not impossible).
## Multivariate Distributions {#mvdist}
### Bivariate Normal {#bvn}
A `r glossary("bivariate normal")` distribution is two normally distributed vectors that have a specified relationship, or `r glossary("correlation")` to each other.
What if we want to sample from a population with specific relationships between variables? We can sample from a bivariate normal distribution using `mvrnorm()` from the `MASS` package.
::: {.warning data-latex=""}
Don't load MASS with the `library()` function because it will create a conflict with the `select()` function from dplyr and you will always need to preface it with `dplyr::`. Just use `MASS::mvrnorm()`.
:::
You need to know how many observations you want to simulate (`n`) the means of the two variables (`mu`) and you need to calculate a `r glossary("covariance matrix")` (`sigma`) from the correlation between the variables (`rho`) and their standard deviations (`sd`).
```{r bvn}
n <- 1000 # number of random samples
# name the mu values to give the resulting columns names
mu <- c(x = 10, y = 20) # the means of the samples
sd <- c(5, 6) # the SDs of the samples
rho <- 0.5 # population correlation between the two variables
# correlation matrix
cor_mat <- matrix(c( 1, rho,
rho, 1), 2)
# create the covariance matrix
sigma <- (sd %*% t(sd)) * cor_mat
# sample from bivariate normal distribution
bvn <- MASS::mvrnorm(n, mu, sigma)
```
Plot your sampled variables to check everything worked like you expect. It's easiest to convert the output of `mvnorm` into a tibble in order to use it in ggplot.
```{r graph-bvn}
bvn |>
as_tibble() |>
ggplot(aes(x, y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
geom_density2d()
```
### Multivariate Normal {#mvnorm}
You can generate more than 2 correlated variables, but it gets a little trickier to create the correlation matrix.
```{r mvn}
n <- 200 # number of random samples
mu <- c(x = 10, y = 20, z = 30) # the means of the samples
sd <- c(8, 9, 10) # the SDs of the samples
rho1_2 <- 0.5 # correlation between x and y
rho1_3 <- 0 # correlation between x and z
rho2_3 <- 0.7 # correlation between y and z
# correlation matrix
cor_mat <- matrix(c( 1, rho1_2, rho1_3,
rho1_2, 1, rho2_3,
rho1_3, rho2_3, 1), 3)
sigma <- (sd %*% t(sd)) * cor_mat
bvn3 <- MASS::mvrnorm(n, mu, sigma)
cor(bvn3) # check correlation matrix
```
You can use the `plotly` library to make a 3D graph.
```{r 3d-graph-mvn}
#set up the marker style
marker_style = list(
color = "#ff0000",
line = list(
color = "#444",
width = 1
),
opacity = 0.5,
size = 5
)
# convert bvn3 to a tibble, plot and add markers
bvn3 |>
as_tibble() |>
plot_ly(x = ~x, y = ~y, z = ~z, marker = marker_style) |>
add_markers()
```
### Faux {#sec-faux}
Alternatively, you can use the package [faux](https://debruine.github.io/faux/) to generate any number of correlated variables. It also has a function for checking the parameters of your new simulated data (`check_sim_stats()`).
```{r faux-rnorm-multi}
bvn3 <- rnorm_multi(
n = n,
vars = 3,
mu = mu,
sd = sd,
r = c(rho1_2, rho1_3, rho2_3),
varnames = c("x", "y", "z")
)
check_sim_stats(bvn3)
```
You can also use faux to simulate data for factorial designs. Set up the between-subject and within-subject factors as lists with the levels as (named) vectors. Means and standard deviations can be included as vectors or data frames. The function calculates sigma for you, structures your dataset, and outputs a plot of the design.
```{r faux-sim-design}
b <- list(pet = c(cat = "Cat Owners",
dog = "Dog Owners"))
w <- list(time = c("morning",
"noon",
"night"))
mu <- data.frame(
cat = c(10, 12, 14),
dog = c(10, 15, 20),
row.names = w$time
)
sd <- c(3, 3, 3, 5, 5, 5)
pet_data <- sim_design(
within = w,
between = b,
n = 100,
mu = mu,
sd = sd,
r = .5)
```
You can use the `check_sim_stats()` function, but you need to set the argument `between` to a vector of all the between-subject factor columns.
```{r}
check_sim_stats(pet_data, between = "pet")
```
See the [faux website](https://debruine.github.io/faux/) for more detailed tutorials.
## Statistical terms {#stat-terms}
Let's review some important statistical terms before we review tests of distributions.
### Effect {#effect}
The `r glossary("effect")` is some measure of your data. This will depend on the type of data you have and the type of statistical test you are using. For example, if you flipped a coin 100 times and it landed heads 66 times, the effect would be 66/100. You can then use the exact binomial test to compare this effect to the `r glossary("null effect")` you would expect from a fair coin (50/100) or to any other effect you choose. The `r glossary("effect size")` refers to the difference between the effect in your data and the null effect (usually a chance value).
<div class="right meme"><img src="images/memes/p-value.jpg"
alt="Top text: Don't know what a p-value is, Bottom text: at this point I'm too afraid to ask"/></div>
### P-value {#p-value}
The `r glossary("p-value")` of a test is the probability of seeing an effect at least as extreme as what you have, if the real effect was the value you are testing against (e.g., a null effect). So if you used a binomial test to test against a chance probability of 1/6 (e.g., the probability of rolling 1 with a 6-sided die), then a p-value of 0.17 means that you could expect to see effects at least as extreme as your data 17% of the time just by chance alone.
### Alpha {#alpha}
If you are using null hypothesis significance testing (`r glossary("NHST")`), then you need to decide on a cutoff value (`r glossary("alpha")`) for making a decision to reject the null hypothesis. We call p-values below the alpha cutoff `r glossary("significant")`. In psychology, alpha is traditionally set at 0.05, but there are good arguments for [setting a different criterion in some circumstances](http://daniellakens.blogspot.com/2019/05/justifying-your-alpha-by-minimizing-or.html).
### False Positive/Negative {#false-pos}
The probability that a test concludes there is an effect when there is really no effect (e.g., concludes a fair coin is biased) is called the `r glossary("false positive")` rate (or `r glossary("Type I Error")` Rate). The `r glossary("alpha")` is the false positive rate we accept for a test. The probability that a test concludes there is no effect when there really is one (e.g., concludes a biased coin is fair) is called the `r glossary("false negative")` rate (or `r glossary("Type II Error")` Rate). The `r glossary("beta")` is the false negative rate we accept for a test.
::: {.info data-latex=""}
The false positive rate is not the overall probability of getting a false positive, but the probability of a false positive *under the null hypothesis*. Similarly, the false negative rate is the probability of a false negative *under the alternative hypothesis*. Unless we know the probability that we are testing a null effect, we can't say anything about the overall probability of false positives or negatives. If 100% of the hypotheses we test are false, then all significant effects are false positives, but if all of the hypotheses we test are true, then all of the positives are true positives and the overall false positive rate is 0.
:::
### Power and SESOI {#power}
`r glossary("Power")` is equal to 1 minus beta (i.e., the `r glossary("true positive")` rate), and depends on the effect size, how many samples we take (n), and what we set alpha to. For any test, if you specify all but one of these values, you can calculate the last. The effect size you use in power calculations should be the smallest effect size of interest (`r glossary("SESOI")`). See [@TOSTtutorial](https://doi.org/10.1177/2515245918770963) for a tutorial on methods for choosing an SESOI.
::: {.try data-latex=""}
Let's say you want to be able to detect at least a 15% difference from chance (50%) in a coin's fairness, and you want your test to have a 5% chance of false positives and a 10% chance of false negatives. What are the following values?
* alpha = `r fitb(c("0.05", ".05", "5%"))`
* beta = `r fitb(c("0.1", "0.10", ".1", ".10", "10%"))`
* false positive rate = `r fitb(c("0.05", ".05", "5%"))`
* false negative rate = `r fitb(c("0.1", "0.10", ".1", ".10", "10%"))`
* power = `r fitb(c("0.9", "0.90", ".9", ".90", "90%"))`
* SESOI = `r fitb(c("0.15", ".15", "15%"))`
:::
### Confidence Intervals {#conf-int}
The `r glossary("confidence interval")` is a range around some value (such as a mean) that has some probability of containing the parameter, if you repeated the process many times. Traditionally in psychology, we use 95% confidence intervals, but you can calculate CIs for any percentage.
::: {.info data-latex=""}
A 95% CI does *not* mean that there is a 95% probability that the true mean lies within this range, but that, if you repeated the study many times and calculated the CI this same way every time, you'd expect the true mean to be inside the CI in 95% of the studies. This seems like a subtle distinction, but can lead to some misunderstandings. See [@Morey2016](https://link.springer.com/article/10.3758/s13423-015-0947-8) for more detailed discussion.
:::
## Tests
### Exact binomial test {#exact-binom}
`binom.test(x, n, p)`
You can test a binomial distribution against a specific probability using the exact binomial test.
* `x` = the number of successes
* `n` = the number of trials
* `p` = hypothesised probability of success
Here we can test a series of 10 coin flips from a fair coin and a biased coin against the hypothesised probability of 0.5 (even odds).
```{r binom-test}
n <- 10
fair_coin <- rbinom(1, n, 0.5)
biased_coin <- rbinom(1, n, 0.6)
binom.test(fair_coin, n, p = 0.5)
binom.test(biased_coin, n, p = 0.5)
```
::: {.info data-latex=""}
Run the code above several times, noting the p-values for the fair and biased coins. Alternatively, you can [simulate coin flips](http://shiny.psy.gla.ac.uk/debruine/coinsim/) online and build up a graph of results and p-values.
* How does the p-value vary for the fair and biased coins?
* What happens to the confidence intervals if you increase n from 10 to 100?
* What criterion would you use to tell if the observed data indicate the coin is fair or biased?
* How often do you conclude the fair coin is biased (false positives)?
* How often do you conclude the biased coin is fair (false negatives)?
:::
#### Sampling function {#sampling-binom}
To estimate these rates, we need to repeat the sampling above many times. A `r glossary("function")` is ideal for repeating the exact same procedure over and over. Set the arguments of the function to variables that you might want to change. Here, we will want to estimate power for:
* different sample sizes (`n`)
* different effects (`bias`)
* different hypothesised probabilities (`p`, defaults to 0.5)
```{r sim_binom_test}
sim_binom_test <- function(n, bias, p = 0.5) {
# simulate 1 coin flip n times with the specified bias
coin <- rbinom(1, n, bias)
# run a binomial test on the simulated data for the specified p
btest <- binom.test(coin, n, p)
# return the p-value of this test
btest$p.value
}
```
Once you've created your function, test it a few times, changing the values.
```{r sbt-2}
sim_binom_test(100, 0.6)
```
#### Calculate power {#calc-power-binom}
Then you can use the `replicate()` function to run it many times and save all the output values. You can calculate the `r glossary("power")` of your analysis by checking the proportion of your simulated analyses that have a p-value less than your `r glossary("alpha")` (the probability of rejecting the null hypothesis when the null hypothesis is true).
```{r replicate-binom}
my_reps <- replicate(1e4, sim_binom_test(100, 0.6))
alpha <- 0.05 # this does not always have to be 0.05
mean(my_reps < alpha)
```
::: {.info data-latex=""}
`1e4` is just scientific notation for a 1 followed by 4 zeros (`10000`). When you're running simulations, you usually want to run a lot of them. It's a pain to keep track of whether you've typed 5 or 6 zeros (100000 vs 1000000) and this will change your running time by an order of magnitude.
:::
You can plot the distribution of p-values.
```{r plot-reps-binom}
ggplot() +
geom_histogram(
aes(my_reps),
binwidth = 0.05,
boundary = 0,
fill = "white",
color = "black"
)
```
### T-test {#sec-t-test}
`t.test(x, y, alternative, mu, paired)`
Use a t-test to compare the mean of one distribution to a null hypothesis (one-sample t-test), compare the means of two samples (independent-samples t-test), or compare pairs of values (paired-samples t-test).
You can run a one-sample t-test comparing the mean of your data to `mu`. Here is a simulated distribution with a mean of 0.5 and an SD of 1, creating an effect size of 0.5 SD when tested against a `mu` of 0. Run the simulation a few times to see how often the t-test returns a significant p-value (or run it in the [shiny app](http://shiny.psy.gla.ac.uk/debruine/normsim/)).
```{r sim-norm}
sim_norm <- rnorm(100, 0.5, 1)
t.test(sim_norm, mu = 0)
```
Run an independent-samples t-test by comparing two lists of values.
```{r t-test}
a <- rnorm(100, 0.5, 1)
b <- rnorm(100, 0.7, 1)
t_ind <- t.test(a, b, paired = FALSE)
t_ind
```
::: {.warning data-latex=""}
The `paired` argument defaults to `FALSE`, but it's good practice to always explicitly set it so you are never confused about what type of test you are performing.
:::
#### Sampling function {#sampling-t}
We can use the `names()` function to find out the names of all the t.test parameters and use this to just get one type of data, like the test statistic (e.g., t-value).
```{r names}
names(t_ind)
t_ind$statistic
```
If you want to run the simulation many times and record information each time, first you need to turn your simulation into a function.
```{r sim_t_ind}
sim_t_ind <- function(n, m1, sd1, m2, sd2) {
# simulate v1
v1 <- rnorm(n, m1, sd1)
#simulate v2
v2 <- rnorm(n, m2, sd2)
# compare using an independent samples t-test
t_ind <- t.test(v1, v2, paired = FALSE)
# return the p-value
return(t_ind$p.value)
}
```
Run it a few times to check that it gives you sensible values.
```{r run-sim_t_ind}
sim_t_ind(100, 0.7, 1, 0.5, 1)
```
#### Calculate power {#calc-power-t}
Now replicate the simulation 1000 times.
```{r reps-t}
my_reps <- replicate(1e4, sim_t_ind(100, 0.7, 1, 0.5, 1))
alpha <- 0.05
power <- mean(my_reps < alpha)
power
```
::: {.try data-latex=""}
Run the code above several times. How much does the power value fluctuate? How many replications do you need to run to get a reliable estimate of power?
:::
Compare your power estimate from simluation to a power calculation using `power.t.test()`. Here, `delta` is the difference between `m1` and `m2` above.
```{r power.t.test}
power.t.test(n = 100,
delta = 0.2,
sd = 1,
sig.level = alpha,
type = "two.sample")
```
You can plot the distribution of p-values.
```{r plot-reps-t}
ggplot() +
geom_histogram(
aes(my_reps),
binwidth = 0.05,
boundary = 0,
fill = "white",
color = "black"
)
```
::: {.try data-latex=""}
What do you think the distribution of p-values is
when there is no effect (i.e., the means are identical)? Check this yourself.
:::
::: {.warning data-latex=""}
Make sure the `boundary` argument is set to `0` for p-value histograms. See what happens with a null effect if `boundary` is not set.
:::
### Correlation {#correlation}
You can test if continuous variables are related to each other using the `cor()` function. Let's use `rnorm_multi()` to make a quick table of correlated values.
```{r cor}
dat <- rnorm_multi(
n = 100,
vars = 2,
r = -0.5,
varnames = c("x", "y")
)
cor(dat$x, dat$y)
```
::: {.try data-latex=""}
Set `n` to a large number like 1e6 so that the correlations are less affected by chance. Change the value of the **mean** for `a`, `x`, or `y`. Does it change the correlation between `x` and `y`? What happens when you increase or decrease the **sd**? Can you work out any rules here?
:::
`cor()` defaults to Pearson's correlations. Set the `method` argument to use Kendall or Spearman correlations.
```{r cor-spearman}
cor(dat$x, dat$y, method = "spearman")
```
#### Sampling function {#sampling-cor}
Create a function that creates two variables with `n` observations and `r` correlation. Use the function `cor.test()` to give you p-values for the correlation.
```{r sim-cor-test}
sim_cor_test <- function(n = 100, r = 0) {
dat <- rnorm_multi(
n = n,
vars = 2,
r = r,
varnames = c("x", "y")
)
ctest <- cor.test(dat$x, dat$y)
ctest$p.value
}
```
Once you've created your function, test it a few times, changing the values.
```{r sim-cor-test-run}
sim_cor_test(50, .5)
```
#### Calculate power {#calc-power-cor}
Now replicate the simulation 1000 times.
```{r reps-cor}
my_reps <- replicate(1e4, sim_cor_test(50, 0.5))
alpha <- 0.05
power <- mean(my_reps < alpha)
power
```
Compare to the value calcuated by the pwr package.
```{r pwr-cor}
pwr::pwr.r.test(n = 50, r = 0.5)
```
## Example
This example uses the [Growth Chart Data Tables](https://www.cdc.gov/growthcharts/data/zscore/zstatage.csv) from the [US CDC](https://www.cdc.gov/growthcharts/zscore.htm). The data consist of height in centimeters for the z-scores of –2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, and 2 by sex (1=male; 2=female) and half-month of age (from 24.0 to 240.5 months).
### Load & wrangle
We have to do a little data wrangling first. Have a look at the data after you import it and relabel `Sex` to `male` and `female` instead of `1` and `2`. Also convert `Agemos` (age in months) to years. Relabel the column `0` as `mean` and calculate a new column named `sd` as the difference between columns `1` and `0`.
```{r load-height-data}
orig_height_age <- read_csv("https://www.cdc.gov/growthcharts/data/zscore/zstatage.csv")
height_age <- orig_height_age |>
filter(Sex %in% c(1,2)) |>
mutate(
sex = recode(Sex, "1" = "male", "2" = "female"),
age = as.numeric(Agemos)/12,
sd = `1` - `0`
) |>
select(sex, age, mean = `0`, sd)
```
### Plot
Plot your new data frame to see how mean height changes with age for boys and girls.
```{r plot-height-means}
ggplot(height_age, aes(age, mean, color = sex)) +
geom_smooth(aes(ymin = mean - sd,
ymax = mean + sd),
stat="identity")
```
### Simulate a population
Simulate 50 random male heights and 50 random female heights for 20-year-olds using the `rnorm()` function and the means and SDs from the `height_age` table. Plot the data.
```{r sim-height-20}
age_filter <- 20
m <- filter(height_age, age == age_filter, sex == "male")
f <- filter(height_age, age == age_filter, sex == "female")
sim_height <- tibble(
male = rnorm(50, m$mean, m$sd),
female = rnorm(50, f$mean, f$sd)
) |>
gather("sex", "height", male:female)
ggplot(sim_height) +
geom_density(aes(height, fill = sex), alpha = 0.5) +
xlim(125, 225)
```
::: {.try data-latex=""}
Run the simulation above several times, noting how the density plot changes. Try changing the age you're simulating.
:::
### Analyse simulated data
Use the `sim_t_ind(n, m1, sd1, m2, sd2)` function we created above to generate one simulation with a sample size of 50 in each group using the means and SDs of male and female 14-year-olds.
```{r subset-height-14}
age_filter <- 14
m <- filter(height_age, age == age_filter, sex == "male")
f <- filter(height_age, age == age_filter, sex == "female")
sim_t_ind(50, m$mean, m$sd, f$mean, f$sd)
```
### Replicate simulation
Now replicate this 1e4 times using the `replicate()` function. This function will save the returned p-values in a list (`my_reps`). We can then check what proportion of those p-values are less than our alpha value. This is the power of our test.
```{r rep-height-14}
my_reps <- replicate(1e4, sim_t_ind(50, m$mean, m$sd, f$mean, f$sd))
alpha <- 0.05
power <- mean(my_reps < alpha)
power
```
### One-tailed prediction
This design has about 65% power to detect the sex difference in height (with a 2-tailed test). Modify the `sim_t_ind` function for a 1-tailed prediction.
You could just set `alternative` equal to "greater" in the function, but it might be better to add the `alt` argument to your function (giving it the same default value as `t.test`) and change the value of `alternative` in the function to `alt`.
```{r add-1tailed}
sim_t_ind <- function(n, m1, sd1, m2, sd2, alt = "two.sided") {
v1 <- rnorm(n, m1, sd1)
v2 <- rnorm(n, m2, sd2)
t_ind <- t.test(v1, v2, paired = FALSE, alternative = alt)
return(t_ind$p.value)
}
alpha <- 0.05
my_reps <- replicate(1e4, sim_t_ind(50, m$mean, m$sd, f$mean, f$sd, "greater"))
mean(my_reps < alpha)
```
### Range of sample sizes
What if we want to find out what sample size will give us 80% power? We can try trial and error. We know the number should be slightly larger than 50. But you can search more systematically by repeating your power calculation for a range of sample sizes.
::: {.info data-latex=""}
This might seem like overkill for a t-test, where you can easily look up sample size calculators online, but it is a valuable skill to learn for when your analyses become more complicated.
:::
Start with a relatively low number of replications and/or more spread-out samples to estimate where you should be looking more specifically. Then you can repeat with a narrower/denser range of sample sizes and more iterations.
```{r range-sample-sizes}
# make another custom function to return power
pwr_func <- function(n, reps = 100, alpha = 0.05) {
ps <- replicate(reps, sim_t_ind(n, m$mean, m$sd, f$mean, f$sd, "greater"))
mean(ps < alpha)
}
# make a table of the n values you want to check
power_table <- tibble(
n = seq(20, 100, by = 5)
) |>
# run the power function for each n
mutate(power = map_dbl(n, pwr_func))
# plot the results
ggplot(power_table, aes(n, power)) +
geom_smooth() +
geom_point() +
geom_hline(yintercept = 0.8)
```
Now we can narrow down our search to values around 55 (plus or minus 5) and increase the number of replications from 1e3 to 1e4.
```{r narrow-range-sample-sizes}
power_table <- tibble(
n = seq(50, 60)
) |>
mutate(power = map_dbl(n, pwr_func, reps = 1e4))
ggplot(power_table, aes(n, power)) +
geom_smooth() +
geom_point() +
geom_hline(yintercept = 0.8)
```
## Glossary {#sec-glossary-sim -}
```{r, echo = FALSE}
glossary_table()
```
## Further Resources {#sec-resources-sim -}
* [Distribution Shiny App](http://shiny.psy.gla.ac.uk/debruine/simulate/) (or run `reprores::app("simulate")`
* [Simulation tutorials](https://debruine.github.io/data-sim-workshops/) by Lisa DeBruine
* [Chapter 21: Iteration](http://r4ds.had.co.nz/iteration.html) of *R for Data Science*
* [Improving your statistical inferences](https://www.coursera.org/learn/statistical-inferences/) on Coursera (week 1)
* [Faux](https://debruine.github.io/faux/) package for data simulation
* [Simulation-Based Power-Analysis for Factorial ANOVA Designs](https://psyarxiv.com/baxsf) [@lakens_caldwell_2019]
* [Understanding mixed effects models through data simulation](https://psyarxiv.com/xp5cy/) [@debruine_barr_2019]