-
Notifications
You must be signed in to change notification settings - Fork 1
/
1-r-tools.Rmd
535 lines (381 loc) · 19 KB
/
1-r-tools.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
---
title: "Review in R"
output: html_document
date: "2024-06-05"
editor_options:
chunk_output_type: console
---
```{r}
if (!require("palmerpenguins")) install.packages("palmerpenguins")
if (!require("gridExtra")) install.packages("gridExtra")
if (!require("VIM")) install.packages("VIM")
library(palmerpenguins)
library(tidyverse)
library(ggpubr)
library(VIM)
```
```{r}
# accessible color
accessible <- TRUE
if (accessible) {
options(ggplot2.discrete.colour = unname(palette.colors(palette = "Okabe-Ito")))
options(ggplot2.discrete.fill = unname(palette.colors(palette = "Okabe-Ito")))
}
```
# Introduction
Welcome to the `Statistical Modeling with R` workshop! In this workshop, we will go over different stages in statistical modeling with emphasis on coding in R. We will first review some of the core packages in the `tidyverse` framework, such as `dplyr` and `ggplot2`, which are used for data cleaning and data exploration. After that, we will move on to conducting statistical tests and creating statistical models in R.
This workshop will not focus on the theoretical groundwork of these statistical models. Rather, we will familiarize ourselves with the process of running statistical models R and learn to leverage tools from `tidyverse` to check assumptions and visualize the model output.
This markdown document is part 1 of this workshop, which is a review of `tidyverse`.
# Data
For the majority of this workshop, we will be working with the `penguins` data set from `palmerpenguins` package.
```{r}
head(penguins)
```
# Quick Review of R
Being able to utilize core functions of R as well as the data manipulation language in `tidyverse` is an essential skill for statistical analysis. Rarely do you start with the data set that is clean and ready for analysis, and often you will need to go back after modelling after evaluating the model fit.
## Data Manipulation with `dplyr` and `tidyr`
Piping operator `%>%` (or `|>`, which is available in base R as of R 4.1.0) is a syntax often used in tidyverse workflow. It allows you to chain multiple functions together, passing the output of one function to the next. While you could achieve the same functionality by enclosing one function within another, multiple levels of nesting can make the code difficult to read.
```{r, eval=FALSE}
# with piping
data %>%
func1() %>%
func2() %>%
func3()
# without piping
func3(func2(func1(data)))
```
You can select columns with `select()` function.
```{r}
penguins %>%
select(species, island, bill_length_mm) %>%
head()
```
Filter rows with the `filter()` function with filter criteria inside the function.
```{r}
penguins %>%
select(species, island, bill_length_mm) %>%
filter(species == "Adelie", island == "Torgersen", bill_length_mm > 45)
```
You can create new columns or replace existing ones with `mutate()` function.
```{r}
penguins %>%
select(species, sex) %>%
# create a new column to flag missing sex
mutate(missing_sex = is.na(sex)) %>%
head(10)
```
You can group data with `group_by()` function and summarize data with `summarize()` function.
```{r}
penguins %>%
group_by(species) %>%
summarize(
n = n(), # number of rows
mean_bill_length = mean(bill_length_mm), # mean of bill length
sd_bill_length = sd(bill_length_mm) # standard deviation of bill length
)
```
If I were to work naively with the data set, this would already show me that I'll have to do something with the missing values.
There are many other functions in these packages that are useful for data manipulation. There may be more functions that I use in this workshop that I have not covered here, but I will do my best to explain as I go along.
## Exercise 1
Select observations where `sex` is not missing and from `Biscoe` island.
```{r}
```
Bonus: Extend the previous result to count the number of rows from each species within `Biscoe`.
```{r}
```
## Data Visualization with ggplot2
Visualizing your data is a great way to understand the shape and trends in your data. Many statistical models and tests assume that your data is distributed in a certain way, and visualizing your data can help you determine if your data meets these assumptions.
For visualization, we will be using the `ggplot2` package. `ggplot2` is covered in our R Fundamentals workshop, which you can access through [this link](https://github.com/nuitrcs/R-fundamentals-summer-workshop). The basic idea of `ggplot2` is to iteratively add "layers" to change different components of the plot.
With `geom_*()` functions, you can choose the type of plot you would like to show (e.g. `geom_point()` for scatter plot, `geom_boxplot()` for boxplot). You can set axis scales with `xlim()`, `ylim()`, or `coord_cartesian()`. You can use `facet_grid()` or `facet_wrap()` to break down your plots into smaller subsets.
First let's create a simple scatter plot between two continuous variables.
```{r}
ggplot(penguins) +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm))
```
You can add color to the points to represent different groups by specifying `color=` inside the `aes()` function.
```{r}
# color by species
ggplot(penguins) +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) +
ggtitle("Bill Length vs Bill Depth", "by Species")
# color by island
ggplot(penguins) +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = island)) +
ggtitle("Bill Length vs Bill Depth", "by Island")
```
Let's try faceting the plot by year and look at body mass across species.
```{r}
ggplot(penguins) +
facet_wrap(~species) +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = island)) +
ggtitle("Bill Length vs Bill Depth", "by Island")
```
## Exercise 2
Plot the relationship between `bill_depth_mm` and `body_mass_g` with `geom_point()`.
Bonus: color the points by `sex` variable
```{r}
ggplot(penguins) +
# add your code here
```
# Cleaning and Summarizing Data
## Data types
Data can come in a number of different types, and this can influence the type of visualization and model that you use. Largely there are two types of variables: **continuous** (or quantitative) and **categorical** variable.
**Continuous variables** are numerical values that can change on a continuous scale. Example of continuous variables are weight, height, and price. In the `penguins` data, `bill_length_mm` and `bill_depth_mm` are examples of continuous variables.
**Categorical variables** are types of variables that can be split into distinct categories or labels that represent different groups. They can be ordered (Ordinal) or non-ordered (Nominal). In the `penguins` data, `species` and `island` are examples of categorical variables.
When you start working with a data, one of the first things you want to do is get an idea of which data is continuous or categorical. This is often clear from reading the documentation on the data set or taking a quick look at the data.
```{r}
glimpse(penguins)
```
In R, you'll often see that the continuous variables are `double` or `integer`. The categorical variables are often `factor` or `character`. The type is shown inside the angled brackets when using `glimpse()`.
For categorical variables, if you know what the possible categories are, it can be advantageous to convert the `character` data types to `factor`.
- This can potentially save lots of memory if you are dealing with lots of observations
- It protects you from inadvertently introducing a new category, which you might not have accounted for in your model.
- When visualizing, you can decide the ordering of the categories
Following is a common error that you might enconter when working with factors.
```{r}
x <- factor(c('a','a','a','b','b'), levels = c('a','b'))
print(x)
x[5] <- "c" # set 5th element to "c"
print(x)
```
Ultimately, which variable you would like to treat as category vs continuous depends on how you would like to characterize the data. For example, let's look at `year`.
```{r}
print(class(penguins$year))
unique(penguins$year)
```
Although the data type is `integer` and I could interpret this as a continuous data, we only have three distinct years, and instead of the trend over time, I might just be interested in characterizing the effect of each year. In this context, I might set `year` as factor instead of keeping it as `integer`.
## Missing Data
Dealing with missing data is an important part of statistical analysis. There can be different reasons for missing data, and depending on the mechanism it can impact the performance of your model. Sometimes it might not affect your model substantively. Other times it can bias your results and possibly invalidate your finding. Therefore it is important to understand why the data is missing and deal with it accordingly.
The subject on missing data is an important field of research in statistics and data science, and going in depth about exploring missingness and different methods for dealing with them warrants another workshop.
```{r}
# select any rows with missing data
penguins %>%
filter(if_any(everything(), is.na))
```
There are many great options for visualizing missingness pattern. You can choose to explore them yourself using `ggplot2`. There are many packages available [online](https://cran.r-project.org/web/views/MissingData.html#exploration) for this purpose.
Here is an example of missing data summary using `VIM` package.
```{r}
# using missing visualization from VIM package
plot(aggr(penguins, plot = FALSE), numbers = TRUE, prop = FALSE, cex.axis = 0.6)
```
In the penguins dataset, most of the missing data were in the sex of the penguins.
What you decide to do from here depends on several factors. In this case, I only have 11 penguins with missing sex out of 344, which is only 3% of the data. Since this is a small portion of data, it can be reasonable decide that it is not worthwhile to try to retrieve the data points and that removal of these points will not substantially change the outcome of my analysis.
Here I will take that approach and do **listwise** deletion, which excludes data points with any single missing value.
```{r}
# drop observations with NA
pgns <- penguins %>% drop_na()
dim(pgns)
```
Another alternative, which I might have considered if there were more missing data points, is try to do **imputation**, which is a process of filling the missing values with best guess.
## Descriptive Statistics
You often want to get descriptive statistics to give a numerical summary of the data. You can also do quick group comparisons by utilizing the `group_by()` and `summarise()` functions from `dplyr`.
Here are some useful functions for summarizing data: `n()`, `mean()`, `sd()`, `median()`, `quantile()`, `min()`, `max()`.
```{r}
pgns %>%
group_by(species, sex) %>% # group by species and sex
summarise(
n = n(),
bill_length_mean = mean(bill_length_mm, na.rm = TRUE),
bill_length_sd = sd(bill_length_mm, na.rm = TRUE)
)
```
### Exercise 3
Calculate the median, 25th quantile, and 75th quantile of `body_mass_g` for each species. For quantile, you can use `quantile()` function.
```{r}
```
Mean is commonly used to characterize the central tendency of the data. Be careful, though: mean is sensitive to outliers, so if you have data with outliers that is causing a heavy skew (i.e. not symmetrically distributed), you will want to use the median, which is more robust.
```{r}
set.seed(100)
simdata <- data.frame(x = c(rgamma(200, shape = 0.7, scale = 8), rnorm(10, 80)))
ggplot(simdata) +
geom_histogram(aes(x = x)) +
geom_vline(aes(xintercept = mean(x)), color = "blue") +
geom_vline(aes(xintercept = median(x)), color = "red") +
ggtitle("Skewed Distribution with Outliers", "Red Line: Median; Blue Line: Mean")
```
How do we determine if there's a skew or extreme outliers? This is where visualization comes in handy.
## Data Visualization
Depending on the type of variable you're exploring, you want to use different type of visualization. In this step, you are looking to understand the distribution of variables and relationship between them.
### Distribution of a Continuous Variable
When it comes to looking at a distribution of a continuous variable, you can use a histogram. This plot shows the distribution by binning across the continuous scale and showing the frequency within each bin.
You should choose the width of the bin so that it best characterizes your data. Choose a width too wide, you will be masking some important trends; choose a width too small, you might not get enough samples in each bin to show meaning pattern.
```{r}
ggplot(pgns) +
geom_histogram(aes(bill_length_mm)) +
ggtitle("Histogram of Bill Length (mm)")
```
```{r}
g1 <- ggplot(pgns) +
geom_histogram(aes(bill_length_mm), binwidth = 5) +
ggtitle("Histogram of Bill Length (mm)", "binwidth = 5")
g2 <- ggplot(pgns) +
geom_histogram(aes(bill_length_mm), binwidth = 1) +
ggtitle("Histogram of Bill Length (mm)", "binwidth = 1")
g3 <- ggplot(pgns) +
geom_histogram(aes(bill_length_mm), binwidth = 0.5) +
ggtitle("Histogram of Bill Length (mm)", "binwidth = 0.5")
g4 <- ggplot(pgns) +
geom_histogram(aes(bill_length_mm), binwidth = 0.1) +
ggtitle("Histogram of Bill Length (mm)", "binwidth = 0.1")
ggarrange(g1, g2, g3, g4)
```
An alternative to histogram is a density plot. This type of plot can be thought of as smoothed histogram. The smoothness of the density can be controlled by an argument `bw=`. The higher the value is, the more the density will look smoother. By default this is chosen for you by an algorithm.
```{r}
ggplot(pgns) +
geom_density(aes(x = bill_length_mm))
```
## Exercise 4
Try different `bw=` to see how it affects the density plot.
```{r}
ggplot(pgns) +
geom_density(aes(x = bill_length_mm), bw = ) # complete code
```
### Continuous vs Categorical
You can compare the distribution across different groups by specifying `fill=`.
```{r}
g1 <- ggplot(pgns) +
geom_histogram(aes(x = bill_length_mm, fill = species)) +
ggtitle("Distribution of Bill Length (mm)", "Color by Species")
g2 <- ggplot(pgns) +
geom_density(aes(x = bill_length_mm, fill = species)) +
ggtitle("Distribution of Bill Length (mm)", "Color by Species")
ggarrange(g1, g2, nrow = 1, common.legend = TRUE)
```
Alternatively, to compare distribution of a continuous variable across groups (continuous vs categorical), you can make a box plot or violin plot.
```{r}
g1 <- ggplot(pgns) +
geom_boxplot(
aes(x = species, y = bill_length_mm),
position = position_dodge2(preserve = "single")
) +
ggtitle("Distribution of Bill Length (mm)")
g2 <- ggplot(pgns) +
geom_violin(
aes(x = species, y = bill_length_mm)
) +
ggtitle("Distribution of Bill Length (mm)")
ggarrange(g1, g2, nrow = 1, common.legend = TRUE)
```
If you don't have too many data points, you can also choose to plot individual points on top of box plots. You can achieve this with `geom_jitter()`.
```{r}
ggplot(pgns, aes(x = species, y = bill_length_mm)) +
geom_boxplot() +
geom_jitter(alpha = 0.8) +
ggtitle("Distribution of Bill Length (mm)")
```
### Exercise 5
Use `iris` data set to compare `Sepal.Length` across different `Species`.
```{r}
ggplot(iris) +
# complete code
```
## Comparing Two Continuous Variables
You can produce a scatter plot for comparing two continuous variables.
```{r}
ggplot(pgns) +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm))
```
```{r}
ggplot(pgns) +
geom_point(aes(x = bill_length_mm, y = body_mass_g, color = species))
```
## Distribution of Categorical Variables
For looking at how categorical variables are distributed, you're often interested in relative counts. Bar plots are appropriate for comparing counts.
```{r}
ggplot(pgns) +
geom_bar(aes(x = species)) +
ggtitle("Distribution of Species")
```
You can use `fill=` to further break down the plots into a second grouping variable.
```{r}
g1 <- ggplot(pgns) +
geom_bar(aes(x = species, fill = sex)) +
ggtitle("Distribution of Sex by Species")
# setting position = position_dodge(1) splits the bars for each sub-group
g2 <- ggplot(pgns) +
geom_bar(aes(x = species, fill = sex), position = position_dodge(1)) +
ggtitle("Distribution of Sex by Species", "with position_dodge(1)")
ggarrange(g1, g2, nrow = 1, common.legend = TRUE)
```
## Exercise 6
Plot the relationship between `bill_depth_mm` and `flipper_length_mm`. Also, try coloring points by different grouping variables.
```{r}
```
## Bonus: Plot Descriptive Statistics
Finally, there is one more useful feature in ggplot that you can add to your toolbelt. This is `stat_summary()`.
If I want to create a plot comparing the means across different groups (for example, crossing the levels between sex and species), I might approach it the following way. Here, I use the original data set to get the mean and standard deviation for each group, and then used the resulting table to plot the means with +- 1 std using `geom_point()` and `geom_errorbar()`.
```{r}
grouped_summary <- pgns %>%
group_by(species, sex) %>%
summarise(
mean_body_mass = mean(body_mass_g),
sd_body_mass = sd(body_mass_g)
)
print(grouped_summary)
# calculate group means and standard deviation
ggplot(grouped_summary) +
geom_point(aes(x = species, y = mean_body_mass, color = sex)) +
# geom_errorbar() takes ymin and ymax aesthetic variables to draw an errorbar
geom_errorbar(aes(
x = species,
ymin = mean_body_mass - sd_body_mass,
ymax = mean_body_mass + sd_body_mass,
color = sex,
width = 0.2
))
```
`stat_summary()` takes your original data, summarizes the data using the desired function, and plots the summary - all in one step! Grouping structure is defined by `x` and other mapping aesthetics such as `color`, `fill`, `shape`, etc.
Arguments inside `stat_summary()`:
- `fun=` argument takes in a function used to summarize the data
- `geom=` defines what geometric object is used to display the data. The values for `geom=` are what you see in the `geom_*()` functions. For example, to plot a point for the transformed data, you can set `geom="point"`.
```{r}
ggplot(pgns, aes(x = species, y = body_mass_g, color = sex)) +
stat_summary(fun = mean, geom = "point") +
stat_summary(fun.data = "mean_sd", geom = "errorbar", width = 0.2)
```
```{r}
ggplot(pgns, aes(x = species, y = body_mass_g, fill = sex)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge(0.95)) +
stat_summary(fun.data = "mean_sd", geom = "errorbar", width = 0.2, position = position_dodge(0.95))
```
# Exercise Solutions
## Exercise 1
```{r}
penguins %>%
filter(!is.na(sex), island == "Biscoe")
penguins %>%
filter(!is.na(sex), island == "Biscoe") %>%
group_by(species) %>%
count()
```
## Exercise 2
```{r}
ggplot(penguins) +
# add your code here
geom_point(aes(x = bill_depth_mm, y = body_mass_g, color = sex))
```
## Exercise 3
```{r}
pgns %>%
group_by(species) %>%
summarize(
median = median(body_mass_g),
`25th Quantile` = quantile(body_mass_g, 0.25),
`75th Quantile` = quantile(body_mass_g, 0.75)
)
```
## Exercise 4
```{r}
ggplot(pgns) +
geom_density(aes(x = bill_length_mm), bw = 1)
```
## Exercise 5
```{r}
ggplot(iris) +
geom_boxplot(aes(x = Species, y = Sepal.Length))
```
## Exercise 6
```{r}
ggplot(pgns) +
geom_point(aes(x = bill_depth_mm, y = flipper_length_mm, color = species))
```