-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathdata_on_display_ls05_boxplots_DEPRECIATED.qmd
333 lines (239 loc) · 12 KB
/
data_on_display_ls05_boxplots_DEPRECIATED.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
---
title: 'Boxplots with {ggplot2}'
output:
html_document:
number_sections: true
toc: true
toc_float: true
css: !expr here::here("global/style/style.css")
highlight: kate
---
```{r, include = FALSE, warning = FALSE, message = FALSE}
## Load packages
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, knitr, gapminder, here)
## Source functions
source(here("global/functions/lesson_functions.R"))
## Source autograder script quietly
#source(here("lessons/ls06_boxplots_autograder.R"))
```
## Boxplots with {ggplot2}
A side-by-side boxplot lets us compare the distribution of a numerical variable split by the values of another variable.
```{r echo = FALSE}
ggplot(gapminder,
aes(x = reorder(continent, gdpPercap, median),
y = gdpPercap,
fill = continent,
color = continent)) +
geom_boxplot(alpha = 0.6,
linewidth = 1) +
scale_fill_manual(values = continent_colors) +
scale_colour_manual(values = continent_colors) +
geom_jitter(width = 0.25,
alpha = 0.5,
color = "black") +
scale_y_log10(labels = scales::dollar) +
labs(y = "Income per person",
title = "GDP per capita grouped by continent",
subtitle = "Gapminder data from 142 countries (1952-2007)") +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
legend.position = "none",
axis.title.x = element_blank())
```
### Learning Objectives
By the end of this lesson, you will be able to:
1. Plot a boxplot to visualize the distribution of continuous data using **`geom_boxplot()`**.
2. Reorder side-by-side boxplots with the **`reorder()`** function.
3. Add a layer of data points on a bloxplot using **`geom_jitter()`**.
### Introduction
A boxplot is one of the simplest ways of representing a distribution of a continuous variable.
A boxplot allows us to visualize the distribution of one or more numeric variables.
![Anatomy of a boxplot](images/boxplot_anatomy.png){alt="Anatomy of a boxplot" width="664"}
It consists of two parts:
1. **Box** --- Extends from the first to the third quartile (Q1 to Q3) with a line in the middle that represents the *median*. The range of values between Q1 and Q3 is also known as an *Interquartile range (IQR)*.
2. **Whiskers** --- Lines extending from both ends of the box indicate variability outside Q1 and Q3. The minimum/maximum whisker values are calculated as $Q1 - 1.5 \times IQR$ to $Q3 + 1.5 \times IQR$ . Everything outside is represented as an *outlier* using dots or other markers.
This is *side-by-side boxplot*. It lets us compare the distribution of a numerical variable split by the values of another variable.
![](images/box_pretty.png){width="667"}
#### Potential pitfalls
Boxplots summarize the data into five numbers, so we might miss important characteristics of the data.
If the amount of data you are working with is not too large, adding individual data points can make the graphic more insightful.
```{r echo = FALSE}
ggplot(gapminder,
aes(x = reorder(continent, gdpPercap, median),
y = gdpPercap,
fill = continent,
color = continent)) +
geom_boxplot(alpha = 0.7,
linewidth = 0.4) +
scale_fill_manual(values = continent_colors) +
scale_colour_manual(values = continent_colors) +
geom_jitter(width = 0.35,
alpha = 0.2,
color = "black",
shape = 16) +
scale_y_log10(labels = scales::dollar) +
labs(y = "Income per person",
title = "GDP per capita grouped by continent",
subtitle = "Gapminder data from 142 countries (1952-2007)") +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
legend.position = "none",
axis.title.x = element_blank())
ggsave("box_pretty_points.png", path = "images",
width = 4.5, height = 3)
```
![](images/box_pretty_points.png){width="582"}
### Packages
```{r}
pacman::p_load(tidyverse,
gapminder,
here)
```
### The `gapminder` dataset
For this lesson, we will be visualizing global health and economic data from the **`gapminder`** data frame, which we've encountered in previous lessons.
```{r}
## View first few rows of the data
head(gapminder)
```
::: {.callout-note title='Recap'}
Gapminder is a country-year dataset with information on 142 countries, divided in to 5 "continents" or world regions.
```{r}
## Data summary
summary(gapminder)
```
Data are recorded every 5 years from 1952 to 2007 (a total of 12 years).
:::
### Basic boxplots with `geom_boxplot()`
We will use boxplots display and compare *distributions* of variables across multiple groups.
The `gapminder` data frame gives us the life expectancy (`lifeExp`) for each country. Let's make a boxplot of life expectancy across continents.
Let's start with a base boxplot and then then add more aesthetics and layers from {ggplot2}.
We will first provide the `gapminder` data frame to `ggplot()` and then specify the aesthetics with `aes()` function. Inside `aes()`, we will specify *x*-axis and *y*-axis variables. To make the boxplot between `continent` vs `lifeExp`, we will use the `geom_boxplot()` layer
```{r}
## Simple boxplot of lifeExp continent
ggplot(gapminder,
aes(x = continent,
y = lifeExp)) +
geom_boxplot()
```
The result is a basic boxplot of `liefExp` for multiple continents.
Let us add colors to the basic boxplot. We can map the `continent` variable to fill color so that each box is colored according to which continent it represents.
```{r}
ggplot(gapminder,
aes(x = continent,
y = lifeExp,
fill = continent)) +
geom_boxplot()
```
::: {.callout-note title='Reminder'}
`ggplot2` allows you to color by specifying a variable. We can use `fill` argument inside the `aes()` function to specify which variable is mapped to fill color.
:::
::: {.callout-tip title='Practice'}
Using the `gapminder` data frame create a boxplot comparing the distribution of GDP per capita (**`gdpPercap`**) for each continent.
Map the **fill color** of the boxes to `continent`, and set the **line width** to 1.
```{r eval = FALSE}
## Type and view your answer:
q1 <- "YOUR ANSWER HERE"
q1
```
Building on your code from the last question, add a **`scale_*()`** function that transforms the y-axis to a **logarithmic scale**.
```{r eval = FALSE}
## Type and view your answer:
q2 <- "YOUR ANSWER HERE"
q2
```
:::
The continents are a factor, and are ordered alphabetically by default. It might be more useful to order them by the mean or median life expectancy.
### Reordering with `reorder()`
We can change the order of boxplots by using the `reorder()` function to rearrange the data being mapped on the *x*-axis.
`reorder()` treats its first argument as a categorical variable (usually a factor), and reorders its levels based on the values of a second variable (usually numeric). To reorder the levels of the `continent` variable based on `lifeExp`, we will write this: `reorder(continent, lifeExp)`.
Here we will edit the `x` argument and tell `ggplot` to plot the reordered variable.
```{r}
ggplot(gapminder,
aes(x = reorder(continent, lifeExp),
y = lifeExp)) +
geom_boxplot()
```
We can clearly see that there are notable differences in median life expectancy between continents. However, there is a lot of overlap between the range of values from each continent. For example, the median life expectancy for the continent of Africa is lower than that of Europe, but several African countries have life expectancy values higher than than the majority of European countries.
#### Reordering by function
mean, median, etc.
::: {.callout-tip title='Practice'}
- Create the boxplot showing the distribution of GDP per capita for each continent, like you did in practice question 2. This time, change the x axis variable to reorder the boxes according to `gdpPercap`.
```{r eval = FALSE}
## Type and view your answer:
q3 <- "YOUR ANSWER HERE"
q3
```
ADD QUESTION ON LABS
:::
### Adding data points with `geom_jitter()`
Boxplots give us a very high-level summary of the distributions and do not show the actual life expectancy values for each country-year in the dataset. One way to display the distribution of individual data points is to plot an additional layer on top of the boxplot. We can do this by simply adding the `geom_point()` function.
```{r}
ggplot(gapminder,
aes(x = reorder(continent, lifeExp),
y = lifeExp,
fill = continent)) +
geom_boxplot()+
geom_point()
```
Adding `geom_point()` as has plotted all the data points on a vertical line. That's not very useful since all the points with same life expectancy value directly overlap and are plotted on top of each other.
One solution for this is to randomly "jitter" data points horizontally. `ggplot` allows you to do that with the `geom_jitter()` function.
```{r}
ggplot(gapminder,
aes(x = reorder(continent, lifeExp),
y = lifeExp,
fill = continent)) +
geom_boxplot() +
geom_jitter()
```
You can also control the width of the jitter with `width` argument and specify transparency of data points with `alpha`.
```{r}
ggplot(gapminder,
aes(x = reorder(continent, lifeExp),
y = lifeExp,
fill = continent)) +
geom_boxplot() +
geom_jitter(width = 0.25,
alpha = 0.5)
```
::: {.callout-note title='Key Point'}
Boxplots have the limitation that they summarize the data into five numbers: the 1st quartile, the median (the 2nd quartile), the 3rd quartile, and the upper and lower whiskers. By doing this, we might miss important characteristics of the data. One way to avoid this is by showing the data with points.
:::
::: {.callout-tip title='Practice'}
- Create the boxplot showing the distribution of GDP per capita for each continent, like you did in practice question 3. Then add a layer of jittered points with `geom_jitter()`.
```{r eval = FALSE}
## Type and view your answer:
q4 <- "YOUR ANSWER HERE"
q4
```
- Adapt your answer to question 4 to make the points more transparent and change the width of the jitter to an appropriate value.
```{r eval = FALSE}
## Type and view your answer:
q5 <- "YOUR ANSWER HERE"
q5
```
SPECIFY VALUES FOR WIDTH AND ALPHA
:::
::: {.callout-note title='Challenge'}
- Building on the boxplot of life expectancy per continent from the previous example, add the `labs()` function to edit text on your plot.
- set the plot title to "Variation in life expectancy across continents (1952-2007)"
- change the x axis label to "Continent", and
- change the y axis label to "Life expectancy (years)".
- Using the boxplot you made in question 5, use the `labs()` function to edit text on your plot. Set the plot title to "Variation in life expectancy across continents (1952-2007)", change the x axis label to "Continent", and the y axis label to "Life expectancy (years)".
```{r eval = FALSE}
## Type and view your answer:
q6 <- "YOUR ANSWER HERE"
q6
```
:::
### Wrap up
Side-by-side boxplots provide us with a way to compare the distribution of a continuous variable across multiple values of another variable. One can see where the median falls across the different groups by comparing the solid lines in the center of the boxes.
To study the spread of a continuous variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points.
### Learning Outcomes
1. You can plot a boxplot to visualize the distribution of continuous data using **`geom_boxplot()`**.
2. You can reorder side-by-side boxplots with the **`reorder()`** function.
3. You can add a layer of individual data points on a bloxplot using **`geom_jitter()`**.
### References {.unlisted .unnumbered}
Some material in this lesson was adapted from the following sources:
- Ismay, Chester, and Albert Y. Kim. 2022. *A ModernDive into R and the Tidyverse*. <https://moderndive.com/>.
`r .tgc_license()`