-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path2-tests-and-modeling.Rmd
693 lines (468 loc) · 24.8 KB
/
2-tests-and-modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
---
title: "Statistical Tests and Models"
output: html_document
date: "2024-06-05"
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(1234)
library(palmerpenguins)
library(tidyverse)
library(broom)
pgns <- penguins %>%
drop_na() %>%
mutate(
species = factor(species),
island = factor(island),
sex = factor(sex),
year = factor(year)
)
```
```{r}
accessible <- TRUE
if (accessible) {
mypal <- unname(palette.colors(palette = "Okabe-Ito"))
options(ggplot2.discrete.colour = mypal)
options(ggplot2.discrete.fill = mypal)
}
```
# Introduction
In part 2 of this workshop, we will learn how to run statistical tests and models in R. We will consider some scenarios in which I want to test some hypothesis. For example, the questions can be:
- "are two categorical variables related in some way, or are they independent?"
- "are the means of a continuous variable equal across two groups?".
The type of test or model that you use depends on what kind of variables you are dealing with and what the hypothesis tests for.
### Note on Assumptions
Each of these tests make a set of **assumptions** about your data, which need to be met in order to provide a valid result. In reality, your data often do not fully meet these assumptions. In fact, serious violations of some assumptions can negatively impact your model by increasing type I error (false positive finding) or decreasing the likelihood of finding true effects.
Do not be discouraged though - many of these tests are **robust** to mild violation of some assumptions. Therefore, it is important to check these assumptions in your data: the goal is not to check if assumptions are entirely fulfilled, but rather to see if there is any serious violation of assumptions. One of the ways to check these assumptions is to use graphical methods. This is where our familiarity with data manipulation tools and `ggplot2` will come into light.
### Quick Note on Hypothesis Testing
Hypothesis testing is a procedure in which you make a statement about a population and you put that to the test. Typically you define a null hypothesis ($H_0$) and alternative hypothesis ($H_a$ or $H_1$) that depends on your research question. The null hypothesis specifies how your data are distributed in the population. You then use the data obtained to measure the evidence **against** the null hypothesis.
One of the ways we quantify this is with a **p-value**, which measures the probability of observing the data, or data with more unlikely outcome under the null hypothesis (assuming that the model and the sample is representative of the population).
As I present the statistical models and tests, I will also include the null and alternative hypotheses to illustrate what is being tested in these procedures.
### Simulation
For some of these illustrations I will use simulated data. The functions we commonly use for simulation are `sample()` for randomly sampling from data.
To illustrate `sample()` function, imagine rolling a fair 6-sided die 5 times. To simulate this, I could write:
```{r}
set.seed(100)
sample(
x = 1:6, # defines a set of possible outcomes from a random experiment
size = 5, # number of times I would like to sample
replace = TRUE # TRUE indicates that I could get the same outcome in the sampling procedure
)
```
In the long run, the proportion of each outcome should be roughly equal to 1/6.
```{r}
x <- sample(x = 1:6, size = 1e6, replace = TRUE)
x <- factor(x, 1:6)
data.frame(x = x) %>%
ggplot() +
geom_bar(aes(x = x)) +
geom_hline(yintercept = 1e6*1/6, color = "red", linewidth = 1.5, linetype = "dashed") +
ggtitle("Proportion of Each Outcome", "Red Line at y = 1/6")
```
It does look close enough, so the die must be fair. Instead of looking at the plot, we could also quantitatively determine if we should believe that the die is fair by using statistical tests.
## Categorical Data Analysis
### Chi-Squared Test for Goodness-of-Fit
In the die simulation I performed above we have one categorical variable: *outcome of a roll*, which can be one of 1, 2, 3, 4, 5, or 6.
First test we will look at is a Chi-Squared Goodness-of-Fit test. This tests if a categorical outcome follows a certain distribution.
Assumptions:
- The variable of interest is categorical
- Levels in the categorical variables are mutually exclusive
- The observations are independent of one another
- Expected value of cells shouldn't be too small (old rule of thumb is \> 5)
$H_0: p_1 = p_2 = p_3 = p_4 = p_5 = p_6 = \frac{1}{6}$
$H_a: \text{At least one of } p_i \neq \frac{1}{6}$
```{r, eval=FALSE}
# read chisq.test documentation
?chisq.test
```
```{r}
x_table <- table(x)
print(x_table)
# Chi-Squared Goodness-of-Fit test
chisq.test(x = x_table, p = rep(1/6, 6))
```
### Chi-Squared Test of Independence
If we want to check if two categorical variables are independent, then we can run Chi-Squared test of Independence. Perhaps I am more likely to flip heads when I flip with my right hand over my left hand. Or maybe these two things are independent and proportion of heads stays the same regardless of which hand I use to flip.
Assumptions:
- The two variables are categorical
- Levels in the categorical variable are mutually exclusive
- The observations are independent of one another
- Expected value of cells shouldn't be too small (old rule of thumb is \> 5)
To illustrate this statistical test we will use a Covid-19 briefing [dataset](https://www.openintro.org/data/index.php?data=simpsons_paradox_covid) from UK in 2021.
```{r}
covid <- read.table("https://www.openintro.org/data/tab-delimited/simpsons_paradox_covid.txt", header = TRUE, sep = "\t")
print(head(covid))
```
Each row represents an individual with information about the person's age group (under 50 and 50 +), vaccination status, and whether the individual has died or survived.
Let's try to see if there is a relationship between vaccine status and outcome.
```{r}
# create barplot
ggplot(covid) +
geom_bar(aes(x = vaccine_status, fill = outcome), position = position_dodge(1)) +
ggtitle("Number of Admissions by Sex")
# barplot might not be so useful -- can instead calculate proportions and
# plot that on barplot
prop_covid <- covid %>%
# get number of observations per Sex-Admit combination
count(vaccine_status, outcome) %>%
# get proportion of Admission and Rejection per Sex
group_by(vaccine_status) %>%
mutate(prop_outcome = n / sum(n))
prop_covid
prop_covid %>%
filter(outcome == "death") %>%
ggplot() +
# if I have a column for the bar's height, can use geom_col()
geom_col(aes(x = vaccine_status, y = prop_outcome)) +
ggtitle("Proportion of Death Outcome by Vaccination Status")
```
Since they are both categorical variables, we can use Chi-Squared test of Independence.
$H_0: \text{Vaccination status and outcome are independent}$
$H_a: \text{Vaccination status and outcome are not independent}$
```{r}
chisq.test(x = covid$vaccine_status, y = covid$outcome)
```
### Exercise 1
Using the same data set, let's look at one more thing. There's an `age_group` variable in the data set. Let's look at proportion of death for each vaccine status per age group.
```{r}
covid %>%
count(age_group, vaccine_status, outcome) %>%
group_by(age_group, vaccine_status) %>%
mutate(proportion = n / sum(n)) %>%
filter(outcome == "death") %>%
ggplot() +
facet_wrap(~age_group, scales = "free") +
geom_col(aes(x = vaccine_status, y = proportion))
```
Then, try running `chisq.test` for each age group to test independence of `vaccine_status` and `outcome`.
```{r}
```
## Continuous Variable (Mean Comparison)
### One Sample T-Test
One sample t-test can be used on a continuous variable to test if the population mean is different from a specific value.
Assumptions:
- The variable is continuous
- The observations are independent of one another
- The data is approximately normally distributed
- The dependent variable does not contain (spurious) extreme outliers
Suppose I am interested in testing if the mean body mass of `Adelie` penguins is 3500.
$H_0: \mu_{mass,Adelie} = 3500$
$H_a: \mu_{mass,Adelie} \neq 3500$
```{r}
pgns_adelie <- pgns %>%
filter(species == "Adelie")
ggplot(pgns_adelie) +
geom_boxplot(aes(y = body_mass_g))
ggplot(pgns_adelie) +
geom_histogram(aes(x = body_mass_g), binwidth = 50)
t.test(pgns_adelie$body_mass_g)
```
### Two Sample T-Test
Two sample t-test is used to test if the population means of two groups are equal or not.
Assumptions:
- The variable is continuous
- The observations are independent of one another
- The data in each group is approximately normally distributed
- The dependent variable does not contain extreme outliers
Suppose I am interested in testing if the mean body mass of `Adelie` vs `Gentoo` penguins.
$H_0: \mu_{mass,Adelie} = \mu_{mass,Gentoo}$
$H_a: \mu_{mass,Adelie} \neq \mu_{mass,Gentoo}$
```{r}
# check if the data is normally distributed
pgns %>%
filter(species %in% c("Adelie", "Gentoo")) %>%
ggplot(aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(show.legend = FALSE) +
stat_summary(geom = "point", fun = mean, show.legend = FALSE, pch = "x", size = 5, color = "white")
pgns %>%
filter(species %in% c("Adelie", "Gentoo")) %>%
ggplot(aes(x = species, y = body_mass_g, fill = species)) +
geom_violin(show.legend = FALSE) +
stat_summary(geom = "point", fun = mean, show.legend = FALSE, pch = "x", size = 5, color = "white")
```
```{r}
t.test(
body_mass_g ~ species,
data = pgns %>% filter(species %in% c("Adelie", "Gentoo")))
```
The first argument passed into the `t.test()` is a formula object, which we will go over soon. What you should know right now is that the variable on the left is the dependent variable and the variable on the right is the independent variable. In the example above, we are comparison the mean of body mass (dependent variable), which depends on the species of the penguin (independent variable).
### One-way ANOVA
If you have more than two groups in a categorical independent variables and you would like to test if the groups means are all equal, you can use ANOVA.
ANOVA (Analysis of Variance) is a type of model for comparison of means between three or more groups. There are different flavors of ANOVA, each of which are suitable for different experimental conditions. If we are comparing the means of a continuous variable across groups from one independent variable, we can use **one-way ANOVA**.
This is an **omnibus test**, meaning that if any pair of means is different, then the test will show statistical significance.
Assumptions:
- The dependent variable is continuous
- The observations are independent of one another
- The data in each group is approximately normally distributed
- Variance of the data in the different groups should be approximately equal\*
- The dependent variable does not contain extreme outliers
Instead of testing the mean of body mass between two species, we can extend the analysis to see if the means across all groups are equal.
$H_0: \mu_{mass,Adelie} = \mu_{mass,Gentoo} = \mu_{mass,Chinstrap}$
$H_a: \text{At least one } \mu_i \text{ not equal}$
```{r}
pgns %>%
ggplot(aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(show.legend = FALSE) +
stat_summary(geom = "point", fun = mean, show.legend = FALSE, pch = "x", size = 5)
```
```{r}
my_aov <- aov(body_mass_g ~ species, data = pgns)
print(summary(my_aov))
```
### Exercise 2
Compare `body_mass_g` across `island`. Plot distributions to check assumptions, and then run `aov()`.
```{r}
```
### R Formula Object
There's one more syntax we should learn in R, which is a `formula` object. If you have done some form of modeling in R, you may have encountered them already.
This is a what a formula object looks like:
```{r}
# a two-sided formula
lhs ~ rhs
```
Presence of `~` tells R that this is going to be a formula object. The meaning of the formula depends on what the function using the formula wants to do this this. Usually in the context of modeling, the formula represents a relationship between the dependent variable (left hand side; `lhs`) and the independent variable(s) (right hand side; `rhs`).
Most often you will use this syntax to specify the relationship between dependent variable and independent variables in your statistical models such as linear regression (`lm()`) and generalized linear model (`glm()`).
This isn't always a case though; a formula object can also be one-sided, with the left hand side being empty. For example, `ggplot2`'s `facet_wrap()` function accepts a one sided formula object to define the stratification variable for faceting
```{r}
ggplot(pgns) +
facet_wrap(~species) + # one-sided formula to stratify by species
geom_point(aes(x = flipper_length_mm, y = body_mass_g))
```
There are special operators that you can use in the formula object to indicate special relationships. For the remainder of this workshop, we will be working with linear models. So let's look at different operators and translate them to the linear model.
- `+`: add variables
```{r}
y ~ x1 + x2 + x3
```
$y = \beta_0 + \beta1x_1 + \beta_2x_2 + \beta3x_3$
- `-`: remove variables. This is commonly seen being used when removing an intercept.
```{r}
y ~ -1 + x1 + x2
```
$y = \beta1x_1 + \beta_2x_2$
- `*`: levels plus interaction
```{r}
y ~ x1*x2
```
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2$
- `:`: interaction
```{r}
y ~ x1:x2
```
$y = \beta_0 + \beta_1x_1x_2$
NOTE: `y ~ x1*x2` is equivalent to `y ~ x1 + x2 + x1:x2`
- `^`: This one is tricky. This alone doesn't mean raise a variable to the power or something. It indicates higher order interaction
```{r}
y ~ (x1 + x2 + x3)^2
```
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_1x_2 + \beta_5x_1x_3 + \beta_6x_2x_3$
- `I()`: "As-**I**s". You should use this for polynomial regression.
```{r}
y ~ x1 + I(x2^2)
```
$y = \beta_0 + \beta_1x_1 + \beta_2{x_2}^2$
- Alternatively, you can use `poly()` for polynomial regression, which automatically includes lower order terms for the variable.
```{r}
y ~ x1 + poly(x2, degree = 2)
```
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 \beta_3{x_2}^2$
### Exercise 3
Write a formula that models `body_mass_g` as a sum of `species`, `sex`, `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`.
Extend the formula above by adding interaction between `species` and `sex`.
```{r}
```
## Linear Regression
Linear regression is used to model a linear relationship between a dependent variable and multiple predictors. For simplicity, I will call the dependent variable Y and predictors X.
The mathematical formula for linear regression can be represented as
$Y = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + ... + \beta_pX_{p} + \epsilon$
Where $Y$ is the outcome, $X_{k}$ is the kth dependent variable, and $\epsilon$ is the error term
Assumptions:
- There is a linear relationship between X's and the mean of Y
- Observations are independent of each other.
- The variance of the error term is the same for any value of X (homoskedasticity)
- The error term is normally distributed with mean 0
First let's try simple linear regression. Suppose I am modelling a relationship between flipper length and body mass.
```{r}
ggplot(pgns) +
geom_point(aes(x = flipper_length_mm, y = body_mass_g)) +
ggtitle("Body Mass (g) vs Flipper Length (mm)")
```
For linear regression, you can use `lm()` function. Use `summary()` on the model output to get the summary of the result.
```{r}
mymodel <- lm(body_mass_g ~ flipper_length_mm, data = pgns)
summary(mymodel)
```
```{r}
ggplot(pgns) +
geom_point(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_abline(intercept = mymodel$coefficients[1], slope = mymodel$coefficients[2], color = "red") +
ggtitle("Body Mass (g) vs Flipper Length (mm)", "with Best Fit Line")
```
One way to check model assumptions is to directly use `plot()` function on the model output.
```{r}
plot(mymodel)
```
With the first plot (Residuals vs Fitted), you can check if the linearity assumption and constant variance assumption is being met.
The second plot (Q-Q Residuals) checks if your error term (residual) is normally distributed. You would like the points to roughly lie on the diagonal line.
The third plot also helps you diagnose non constant variance.
The last plot is a little more complicated, but in this plot you look for points with high leverage (x-axis) with large absolute residual value (y-axis). These are points that can influence the final output of your linear model.
Extending the model to multiple parameters is also straightforward. Let's try a more complicated model. I'll model body mass against species, island, sex, bill length, bill depth, and flipper length.
```{r}
mymodel2 <- lm(body_mass_g ~ species + island + sex + bill_length_mm + bill_depth_mm + flipper_length_mm, data = pgns)
summary(mymodel2)
```
Here are the plots for model diagnostics
```{r}
plot(mymodel2)
```
### Exercise 4
In previous ANOVA example we saw significant differences between islands in body mass, but in linear regression it now looks like the effects are not statistically significant. Can you make a plot to explain why this is a case?
Hint: Since body mass is numeric and island and species are categorical, you could use boxplot to get distribution across each category in one variable and use fill to split by second variable.
```{r}
```
### Exercise 5
After looking at this plot, I also suspect that there is an interaction between species and sex. In other words, I suspect that the relationship between sex and body weight depends on the species. Can you create a linear model to capture this behavior?
What p-value did you get for the interaction terms? Also, create a diagnostic plot - do you think it has improved the model compared to the model without the interaction terms?
```{r}
ggplot(pgns) +
geom_boxplot(aes(x = species, y = body_mass_g, fill = sex))
```
```{r}
```
## Bonus 1: Using `broom` Package
Instead of just looking at the model output, you might also want to retrieve some aspects of the model as a data frame for visualization or comparison across models.
This is how it would look like in base R:
- Getting coefficient, standard error, and p-value.
```{r}
summary(mymodel2)$coefficients
```
- Getting residuals
```{r}
head(residuals(mymodel2))
# or
head(mymodel2$residuals)
```
- Getting fitted values from the original data
```{r}
head(mymodel2$fitted.values)
```
There are several problems with this. First of all, that's a lot of syntax to memorize, and there's no guarantee that the syntax / names of these components are the same across different model objects. Also, you might refer back to the original data when looking at residuals and fitted values, so it would be nice to have a way to retrieve all of them together. This is where the `broom` package shines: it takes the messy output of models and transform them into tidy table format.
There are 3 core functions in this package:
- `tidy`: construct a table that summarizes model's statistical findings.
- `augment`: add columns to the original data set that was modeled (e.g. fitted value, residual).
- `glance`: construct a one-row summary of the entire model.
The `broom` tidies models from popular modelling packages and supports most model objects that come with base R. You can check out which methdos are supported with [`vignette("available-methods")`](https://broom.tidymodels.org/articles/available-methods.html).
```{r}
tidy(mymodel2)
augment(mymodel2)
```
New columns start with `.` to avoid overwriting original columns.
`glance()` contains summary used to describe things like the fit of the model, such as R-squared or the F-Statistic.
```{r}
glance(mymodel2)
```
`broom` allows me to explore the model more easily; for example, I can recreate the diagnostic plot, and even add more information.
```{r}
mymodel2 %>%
augment() %>%
mutate(rn = row_number()) %>%
ggplot(aes(x = .hat, y = .std.resid, color = sex)) +
geom_point() +
geom_text(aes(label = rn), vjust = 1.5) +
ggtitle("Recreate Residuals vs Leverage")
```
## Bonus 2: Generalized Linear Model - Logistic Regression
If you have an outcome variable that is categorical (e.g. sex, species) or numeric but not necessarily continuous (e.g. count), then Generalized Linear Model can be an option. Depending on the type of outcome variable that you have you can use a different class of GLM:
- For a binary outcome, consider **Logistic Regression**
- For count outcome, consider **Poisson Regression**
- For continuous non-negative outcome, consider **Gamma Distribution**
Let's consider a case in which we would like to predict `sex` in the penguins data set we've been using.
```{r}
table(pgns$sex)
```
Since `sex` is either male or female in the data set, we have a binary outcome, so we can use logistic regression.
We can use `glm()` to run generalized logistic regression. Syntax is very similar to `lm()`, but now we also specify `family=` to tell R which type of regression to run with which link function.
```{r}
logreg_model <- glm(
formula = sex ~ species + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
data = pgns,
family = binomial(link = "logit")
)
summary(logreg_model)
```
We can see that `broom` package functions also support logistic regression:
```{r}
logreg_model %>%
broom::tidy()
```
An exponentiation of these coefficients might be more interpretable, which represents the odds ratio. Values greater than 1 means that an increase in that predictor makes the outcome more likely too occur; in this case, higher probability of a penguin being a male.
```{r}
logreg_model %>%
broom::tidy(exp = TRUE)
```
You can also use `augment()` to get the fitted values. In logistic regression model, the value on the dependent variable side is the log odds, so that's the default output for the fitted value.
```{r}
logreg_model %>%
augment()
```
To get the probability output, you can set `type.predict = "response"` inside the function. Using this probability, we could make a prediction rule. For example, if the probability is over 0.50, we predict that the penguin is a male.
```{r}
prediction_df <- logreg_model %>%
# get fitted values
augment(type.predict = "response") %>%
# predict penguin sex
mutate(predicted_sex = ifelse(.fitted >= 0.5, "male", "female")) %>%
select(sex, predicted_sex, .fitted)
print(head(prediction_df))
table(predicted = prediction_df$predicted_sex, actual = prediction_df$sex)
```
You can learn more about other options by running `?augment.glm()`.
In this section we learned how to predict categories using logistic regression. Predicting categorical outcome is often referred to as classification problem in machine learning.
One common way to assess the model for binary classification problem is to calculate what is called the Area Under the [Receiver Operating Characteristic Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (ROC).
There are many R packages that calculate the AUROC, but we will use the `pROC` package.
We can find out the AUROC, and plot the ROC curve.
```{r, error = TRUE}
# will install pROC package if it does not exist
if (!require("pROC")) install.packages("pROC")
library(pROC)
myROC <- roc(prediction_df$sex, prediction_df$.fitted)
print(myROC$auc)
plot(myROC)
```
## Bonus 3: Mathematical Formulation of Logistic Regression
Following is a mathematical formulation of the logistic regression model:
$$ log\big(\frac{p}{1-p}\big) = \beta_0 + \beta_1x_1 + \beta_2x_2... $$
This looks very similar to the linear regression except for the left part. The $p$ represents a probability. In our example, if we set female = 0 and male = 1, then $p$ is the probability that a penguin is a male. The term inside the log, $\frac{p}{1-p}$ represents the **odds**. Higher value indicates higher probability of an event.
The function $f(x) = log\big(\frac{x}{1-x}\big)$ is called **logit**, and this is essentially what "links" our outcome of interest to the linear prediction.
## Exercise Solutions
### Exercise 1
```{r}
covid_under_50 <- covid %>% filter(age_group == "under 50")
covid_over_50 <- covid %>% filter(age_group == "50 +")
chisq.test(x = covid_under_50$vaccine_status, y = covid_under_50$outcome)
chisq.test(x = covid_over_50$vaccine_status, y = covid_over_50$outcome)
```
### Exercise 2
```{r}
ggplot(pgns) +
geom_boxplot(aes(x = island, y = body_mass_g))
# we observe heteroskedasticity
summary(aov(body_mass_g ~ island, data = pgns))
```
### Exercise 3
```{r}
body_mass_g ~ species + sex + bill_length_mm + bill_depth_mm + flipper_length_mm
body_mass_g ~ species*sex + bill_length_mm + bill_depth_mm + flipper_length_mm
```
### Exercise 4
```{r}
ggplot(pgns) + geom_boxplot(aes(x = species, y = body_mass_g, fill = island))
ggplot(pgns) + geom_boxplot(aes(x = island, y = body_mass_g, fill = species))
```
### Exercise 5
```{r}
mymodel3 <- lm(body_mass_g ~ species * sex + island + bill_length_mm + bill_depth_mm + flipper_length_mm, data = pgns)
summary(mymodel3)
plot(mymodel3)
```