-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathdata_on_display_ls04_histograms.qmd
330 lines (234 loc) · 10.2 KB
/
data_on_display_ls04_histograms.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
title: 'Histograms with {ggplot2}'
output:
html_document:
number_sections: true
toc: true
toc_float: true
css: !expr here::here("global/style/style.css")
highlight: kate
---
```{r, include = FALSE, warning = FALSE, message = FALSE}
## Load packages
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, knitr, here, patchwork)
## Source functions
source(here("global/functions/lesson_functions.R"))
## Source autograder script quietly
source(here("lessons/ls04_histograms_autograder.R"))
```
## Histograms with {ggplot2}
```{r echo = F, include=F, eval=F}
ggplot(data = malidd ,
mapping = aes(x = muac_cm)) +
geom_histogram(fill = "seagreen",
color = "white",
bins = 22,
mapping = aes(y = after_stat(density))) +
geom_density(fill = "seagreen4",
color = "gray30",
alpha = 0.25,
linewidth = 1,
bw = 0.4) +
labs(x = "Middle-upper arm circumfrence (cm)",
y = "Probability density") +
theme_light()
```
## Learning Objectives
By the end of this lesson, you will be able to:
1. Plot a histogram to visualize the distribution of continuous variables using **`geom_histogram()`**.
2. Adjust the number or size of bins on a histogram by with the **`bins`** or **`binwidth`** arguments.
3. Shift and align bins on a histogram with the **`boundary`** argument.
4. Set bin boundaries to a sequence of values with the **`breaks`** argument.
## Introduction
A histogram is a plot that visualizes the *distribution* of a numerical value as follows:
1. We first cut up the x-axis into a series of *bins*, where each bin represents a range of values.
2. For each bin, we count the number of observations that fall in the range corresponding to that bin.
3. Then for each bin, we draw a bar whose height marks the corresponding count.
## Packages
```{r}
pacman::p_load(tidyverse,
here)
```
## Childhood diarrheal diseases in Mali
We will visualize distributions of numerical variables in the `malidd` data frame, which we've seen in previous lessons.
```{r message = FALSE}
## Import data from CSV
malidd <- read_csv(here::here("data/clean/malidd.csv"))
```
::: {.callout-note title='Recap'}
These data were collected as part of an observational study of acute diarrhea in children aged 0-59 months. The study was conducted in Mali and in early 2020. The dataset records demographic and clinical information for 150 patients.
:::
```{r render = .reactable_10_rows}
## View first few rows of the data frame
head(malidd)
```
The dataframe has 21 variables, many of which are continuous, like `height_cm`, `viral_load`, and `freqrespi`.
## Basic histograms with `geom_histogram()`
Now let's use {ggplot2} to plot the distribution of childrens' heights, which is recorded in the `heigh_cm` column of `malidd`.
The `geom_*()` function used for histograms is **`geom_histogram()`**
```{r}
## Simple histogram showing the distribution of height_cm
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram()
```
::: {.callout-note title='Side Note'}
If we don't adjust the bins in `geom_histogram()`, we get a warning message. You can ignore this warning message for now, and will learn how to customize bins in the next section.
:::
In the previous histogram, it's hard to where the boundaries for each bin start and end since everything is one big amorphous blob. So let's add borders around the bins:
```{r}
## Set border color to "white"
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white")
```
We now have an easier time associating ranges of cases to each of the bins.
We can also vary the color of the bars by setting the `fill` argument:
```{r}
## Set fill color to "steelblue"
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue")
```
Now that we can see the bars more clearly, let's unpack the resulting histogram. Some questions we might want to answer are:
1. What are the smallest and largest values?
2. What is the "center" or "most typical" value?
3. How do the values spread out?
4. What are frequent and infrequent values?
We can see that heights range from 50 to 105cm. The center is around 70cm, most patients fall in the 60-80cm range, with very few below 55cm or above 90cm. Observe that the histogram has a bell shape, meaning that the variable has a *normal distribution* (more or less).
::: {.callout-tip title='Practice'}
- Plot a histogram showing the distribution of age (`age_months`) in `malidd`. Make the borders and fill of the bars "seagreen", and reduce opacity to 40%.
```{r eval=F, include=F}
## Create plot
```
```{r eval=F, include=FALSE}
## When you think you have the right answer, submit it by replacing "YOUR ANSWER HERE" with your code, and run those lines.
q1 <- "YOUR ANSWER HERE"
## Make sure that "q1" appears in your Environment tab.
```
```{r eval=F, include=FALSE}
## Check your answer by running this check function (no inputs required). The output will tell you if you answered correctly or not.
.CHECK_q1()
```
```{r eval=F, include=FALSE}
## You can ask for a hint by running this hint function (no inputs required).
.HINT_q1()
```
- Building on your code for the previous plot, modify the axis titles to "Age (months)" and "Number of children", respectively.
```{r include=F}
## Change axis titles
```
```{r eval=F, include=FALSE}
## Submit your answer:
q2 <- "YOUR ANSWER HERE"
## Ask for a hint:
.HINT_q2()
## Check your answer:
.CHECK_q2()
```
:::
## Adjusting bins in a histogram
![Histograms plotting the same variable with different bin settings.](images/hist_varied_bins.png)
After running code in previous examples, we got a histogram as well as a warning message about bins and bin width. The warning message is telling us that the histogram was constructed using `bins = 30` for 30 equally spaced bins.
```{r}
## Warning message tells us to change the default of 30 bins
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue")
```
Unless you override this default number of bins with a number you specify, R will keep giving this message.
We can change the number of bins to another value using one of these three arguments to `geom_histogram()`:
1. Set the number of bins with **`bins`**
2. Set the width of the bins with **`binwidth`**
3. Set bin boundaries **`breaks`**
### Set the number of bins with `bins`
Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in by setting `bins = INTEGER`:
```{r}
## Try different numbers of bins
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(bins = 5,
color = "white",
fill = "steelblue")
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(bins = 20,
color = "white",
fill = "steelblue")
ggplot(data = malidd ,
mapping = aes(x = height_cm)) +
geom_histogram(bins = 50,
color = "white",
fill = "steelblue")
```
::: {.callout-tip title='Practice'}
Make a histogram of frequency of respiration (`freqrespi`), which is measured in breaths per minute. Set the interior color to "indianred3", and border color to "lightgray".
Notice that with the default of 30 bins, there are some intervals for which no bar is plotted (i.e., there were no observations in that range).
Low the number of bins until there are no empty intervals. You should choose the highest value of bins for which there are no empty spaces.
:::
### Set the width of bins with `binwidth`
Using the second method, instead of specifying the number of bins, we specify the width of the bins by using the `binwidth` argument in `geom_histogram()`.
```{r}
## Try different bin widths
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue",
binwidth = 3)
```
Looking at the range of the variable can help us choose an appropriate bin width.
```{r}
range(malidd$height_cm)
```
```{r}
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue",
binwidth = 5)
```
We can use the `boundary` argument to align the bins to the x-axis intervals.
```{r}
## Set `boundary` equal to the low end of the variable
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue",
binwidth = 5,
boundary = 50)
```
::: {.callout-tip title='Practice'}
Create the same `freqrespi` histogram from the last practice question, but this time set the bin width to to a value that results in 18 bins. Then align the bars to the x axis breaks by adjusting the bin boundaries.
:::
### Modify bin boundaries with `breaks`
Set `breaks` equal to a **numeric vector** in `geom_histogram()`:
```{r}
## Supply a vector that covers the range of values in height_cm
ggplot(data = malidd,
mapping = aes(x = height_cm)) +
geom_histogram(color = "white",
fill = "steelblue",
breaks = seq(50, 105, 5))
```
::: {.callout-tip title='Practice'}
Plot the `freqrespi` histogram with bin breaks that range from the lowest value of `freqrespi` to the highest, with intervals of 4.
Next, adjust the x-axis scale breaks by adding a `scale_*()` function. Set the range to 24-60, with an intervals of 8.
:::
## Summary
Histograms, unlike scatterplots and linegraphs, present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question.
## References {.unlisted .unnumbered}
Some material in this lesson was adapted from the following sources:
- Ismay, Chester, and Albert Y. Kim. 2022. *A ModernDive into R and the Tidyverse*. <https://moderndive.com/>.
- Chang, Winston. 2013. *R Graphics Cookbook: Practical Recipes for Visualizing Data*. 1st edition. Beijing Köln: O'Reilly Media.
## Solutions
```{r}
.SOLUTION_q1()
.SOLUTION_q2()
.SOLUTION_q3()
.SOLUTION_q4()
.SOLUTION_q5()
```
`r .tgc_license()`