-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathfoundations_ls04_data_dive_ebola_sierra_leone.qmd
670 lines (426 loc) · 27.7 KB
/
foundations_ls04_data_dive_ebola_sierra_leone.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
---
title: 'Data dive: Ebola in Sierra Leone'
output:
html_document:
number_sections: true
toc: true
toc_float: true
css: !expr here::here("global/style/style.css")
highlight: kate
editor_options:
markdown:
wrap: 100
canonical: true
chunk_output_type: console
---
```{r, include = FALSE, warning = FALSE, message = FALSE}
## Load packages
if (!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, knitr, here)
knitr::opts_chunk$set(warning = F, message = F,
class.source = "tgc-code-block", error = T)
## Source functions
source(here("global/functions/misc_functions.R"))
## We secretly import the dataset from the data folder
ebola_sierra_leone <- read_csv(here("data/ebola_sierra_leone.csv"))
```
------------------------------------------------------------------------
## Learning objectives {.unlisted .unnumbered}
1. You can use RStudio's graphic user interface to import CSV data into R.
2. You can explain the concept of reproducibility.
3. You can use the `nrow()`, `ncol()` and `dim()` functions to get the dimensions of a dataset, and the `summary()` function to get a summary of the dataset's variables.
4. You can use `vis_dat()`, `inspect_num()` and `inspect_cat()` to obtain visual summaries of a dataset.
5. You can inspect a numeric variable:
- with the summary functions `mean()` , `median()`, `max()`, `min()`, `length()` and `sum()`;
- with esquisse-generated ggplot2 code.
6. You can inspect a categorical variable:
- with the summary functions `table()` and `janitor::tabyl()`;
- with the graphical functions `barplot()` and `pie()`.
## Introduction
With your newly-acquired knowledge of functions and objects, you now have the basic building blocks required to do simple data analysis in R. So let's get started. The goal is to start working with data as quickly as possible, even before you feel ready.
Here you will analyze a dataset of confirmed and suspected cases of Ebola hemorrhagic fever in Sierra Leone in May and June of 2014 (Fang et al., 2016). The data is shown below:
```{r render = head_10_rows, echo = F}
ebola_sierra_leone
```
You will import and explore this dataset, then use R to answer the following questions about the outbreak:
- **When was the first case reported?**
- **What was the median age of those affected?**
- **Had there been more cases in men or women?**
- **What district had had the most reported cases?**
- **By the end of June 2014, was the outbreak growing or receding?**
## Script setup
First, open a new script in RStudio with `File > New File > R Script`. (If you are on RStudio, you can open up any of your previously-created projects.)
![](images/new_r_script.jpg){width="481"}
Next, save the script with `File > Save As` or press `Command`/`Control` + `S` to bring up the Save File dialog box. Save the file with the name "ebola_analysis" or something similar
::: {.callout-note title='Side Note'}
**Empty your environment at the start of the analysis**
When you start a new analysis, your R environment should usually be empty. Verify this by opening the *Environment* tab; it should say "Environment is empty". If instead, it shows some previously-loaded objects, it is recommended to restart R by going to the menu option `Session > Restart R`
:::
### Header
Add a title, name and date to the start of the script, as code comments. This is generally good practice for writing R scripts, as it helps give you and your collaborators context about your script. Your header may look like this:
```{r eval = F}
## Ebola Sierra Leone analysis
## John Sample-Name Doe
## 2024-01-01
```
### Packages
Next, use the `p_load()` function from {pacman} to load the packages you will be using. Put this under a section header called "Load packages", with four hyphens, as shown below:
```{r}
## Load packages ----
if(!require(pacman)) install.packages("pacman")
pacman::p_load(
tidyverse, # meta-package
inspectdf,
plotly,
janitor,
visdat,
esquisse
)
```
::: {.callout-note title='Reminder'}
Remember that the *full signifier* of a function includes both the package name and the function name, `package::function()`. This full signifier is handy if you want to use a function before you have loaded its source package. This is the case in the code chunk above: we want use `p_load()` from {pacman} without formally loading the {pacman} package, so we type `pacman::p_load()`
We could also first load {pacman} before using the p_load function:
```{r eval = F}
library(pacman) # first load {pacman}
p_load(tidyverse) # use `p_load` from {pacman} to load other packages
```
(Also recall that the benefit of `p_load()` is that it automatically installs a package if it is not yet installed. Without `p_load()`, you have to first install the package with `install.packages()` before you can load it with `library()`.)
:::
## Importing data into R
Now that the needed packages are loaded, you should import the dataset.
::: {.callout-note title='Side Note'}
**About the Ebola dataset**
The data you will be working on contains a sample of patient information from the 2014-2016 Ebola outbreak in Sierra Leone. It comes from a research paper which analyzed the transmission dynamics of that outbreak. Key variables include the `status` of a case, whether the case was "confirmed" or "suspected"; the `date_of_onset`, when Ebola-like symptoms arose in a patient; and the `date_of_sample`, when the test sample was taken. To learn more about these data, visit the source publication here: [bit.ly/ebola-data-source](https://bit.ly/ebola-data-source){target="_blank"}. Or search the following DOI on DOI.org: 10.1073/pnas.1518587113.
:::
Go to [bit.ly/view-ebola-data](https://bit.ly/view-ebola-data){target="_blank"} to view the dataset you will be working on. Then click the download icon at the top to download it to your computer.
![](images/download_data_ebola.png){width="674"}
You can leave the dataset in your downloads folder, or move it to somewhere more respectable; the upcoming steps will work independent of where the data is stored. In the next lesson, you will learn how to organize your data analysis projects properly, and we will think about the ideal folder setup for storing data.
------------------------------------------------------------------------
::: {.callout-note title='RStudio Cloud'}
NOTE: If you are using RStudio Cloud, you need to upload your dataset to the cloud. Do this in the "Files" tab by clicking on the "Upload" button.
![](images/rstudio_cloud_upload_button.png){width="500"}
:::
------------------------------------------------------------------------
Next, on the RStudio menu, go to `File > Import Dataset > From Text (readr)`. ![](images/import_data.jpg){width="421"}
Browse through the computer's files and navigate to the downloaded dataset. Click to open it. You should see an import dialog box like this:
![](images/ebola_dat_import_dialog.jpg)
Leave all the import settings at the default values; simply click on "Import" at the bottom; this should load the dataset into R. You can tell this by looking at your environment pane, which should now feature an object called "ebola_sierra_leone" or something similar:
![](images/environment_with_ebola_data.png){width="507"}
RStudio should also have called the `View()` function on your dataset, so you should see a familiar spreadsheet view of this data:
![](images/view_ebola_data.png){width="576"}
Now take a look at your console. Do you observe that your actions in the graphical user interface actually triggered some R code to be run? Copy the line of code that includes the `read_csv()` function, leaving out the `>` symbol.
![](images/copy_import_code.jpg){width="613"}
Paste the copied code into your R script, and label this section "Load data". This may look something like the below (the file path inside quotes will differ from computer to computer.
```{r eval = F}
## Load data ----
ebola_sierra_leone <- read_csv("~/Downloads/ebola_sierra_leone.csv")
```
::: {.callout-note title='Recap'}
Nice work so far!
Your R script should look similar to this:
```{r eval = F}
## Ebola Sierra Leone analysis
## John Sample-Name Doe
## 2024-01-01
## Load packages ----
if(!require(pacman)) install.packages("pacman")
pacman::p_load(
tidyverse,
inspectdf,
plotly,
janitor,
visdat
)
## Load data ----
ebola_sierra_leone <- read_csv("~/Downloads/ebola_sierra_leone.csv")
```
:::
## Intro to reproducibility
Now that the code for importing data is in your R script, you can easily rerun this script anytime to reimport the dataset; there will be no need to redo the manual point-and-click procedure for data import.
Try restarting R and rerunning the script now. Save your script with `Control/Command` + `s` , then *restart* R with the RStudio Menu, at `Session > Restart R`. On RStudio Cloud, the menu option looks like this:
![](images/restart_r_session.png){width="204"}
If restarting is successful, your console should print this message:
![](images/restarting_r_session.png){width="485"}
You should also see the phrase "Environment is empty" in the Environment tab, indicating that the dataset you imported is no longer stored by R---you are starting with a fresh workspace.
![](images/environment_is_empty.png){width="481"}
To re-run your script, use `Command/Control` + `a` to highlight all the code, then `Command/Control` + `Enter` to run it.
If this worked, congratulations; you have the beginnings of your first "reproducible" analysis script!
::: {.callout-note title='Vocab'}
**What does "reproducible" mean?**
When you do things with code rather than by pointing and clicking, it is easy for anyone to re-run, or *reproduce* these steps, by simply re-running your script.
While you can use RStudio's graphical user interface to point-and-click your way through the data import process, you should always copy the relevant code to your script so that your script remains a reproducible record of all your analysis steps.
Of course, your script so far is not yet *entirely* reproducible, because the file path for the dataset (the one that looks like this: "...intro-to-data-analysis-with-r/ch01_getting_started/data...") is specific to just your computer. Later on we will see how to use relative file paths, so that the code for importing data can work on anyone's computer.
:::
::: {.callout-caution title='Watch Out'}
If your environment was not empty after restarting R, it means you skipped a step in a previous lesson. Do this now:
- In the RStudio Menu, go to `Tools > Global Options` to bring up RStudio's options dialog box.
- Then go to `General > Basic`, and **uncheck** the box that says "Restore .RData into workspace at startup".
- For the option, "save your workspace to .RData on exit", set this to "Never".
![](images/dont_restore_rdata.png){width="542"}
:::
## Quick data exploration
Now let's walk through some basic steps of data exploration---taking a broad, bird's eye look at the dataset. You should put this section under a heading like "Explore data" in your script.
To view the top and bottom 6 rows of the dataset, you can use the `head()` and `tail()` functions:
```{r}
## Explore data ----
head(ebola_sierra_leone)
tail(ebola_sierra_leone)
```
To view the whole dataset, use the `View()` function.
```{r eval = F}
View(ebola_sierra_leone)
```
This will again open a familiar spreadsheet view of the data:
![](images/view_ebola_data.png){width="504"}
You can close this tab and return to your script.
------------------------------------------------------------------------
The functions `nrow()`, `ncol()` and `dim()` give you the dimensions of your dataset:
```{r}
nrow(ebola_sierra_leone) # number of rows
ncol(ebola_sierra_leone) # number of columns
dim(ebola_sierra_leone) # number of rows and columns
```
::: {.callout-note title='Reminder'}
If you're not sure what a function does, remember that you can get function help with the question mark symbol. For example, to get help on the `ncol()` function, run:
```{r}
?ncol
```
:::
------------------------------------------------------------------------
Another often-helpful function is `summary()`:
```{r}
summary(ebola_sierra_leone)
```
As you can see, for numeric columns in your dataset, `summary()` gives you the minimum value, the maximum value, the mean, median and the 1st and 3rd [quartiles](https://www.mathsisfun.com/data/quartiles.html).
For character columns it gives you just the length of the column (the number of rows), the "class" and the "mode". We will discuss what "class" and "mode" mean later.
### `vis_dat()`
The `vis_dat()` function from the {visdat} package is a wonderful way to quickly visualize the data types and the missing values in a dataset. Try this now:
```{r dpi = 350}
vis_dat(ebola_sierra_leone)
```
From this figure, you can quickly see the character, date and numeric data types, and you can note that age is missing for some cases.
### `inspect_cat()` and `inspect_num()`
Next, `inspect_cat()` and `inspect_num()` from the {inspectdf} package give you visual summaries of the distribution of variables in the dataset.
If you run `inspect_cat()` on the data object, you get a tabular summary of the [categorical](https://psyteachr.github.io/glossary/c.html?q=categor#categorical) variables in the dataset, with some information hidden in the `levels` column (later you will learn how to extract this information).
```{r}
inspect_cat(ebola_sierra_leone)
```
But the magic happens when you run `show_plot()` on the result from `inspect_cat()`:
```{r}
## store the output of `inspect_cat()` in `cat_summary`
cat_summary <- inspect_cat(ebola_sierra_leone)
## call the `show_plot()` function on that summmary.
show_plot(cat_summary)
```
You get a wonderful figure showing the distribution of all categorical and date variables!
::: {.callout-note title='Side Note'}
You could also run:
```{r}
show_plot(inspect_cat(ebola_sierra_leone))
```
:::
From this plot, you can quickly tell that most cases are in Kailahun, and that there are more cases in women than in men ("F" stands for "female").
One problem is that in this plot, the smaller categories are not labelled. So, for example, we are not sure what value is represented by the white section for "status" at the bottom right. To see labels on these smaller categories, you can turn this into an interactive plot with the `ggplotly()` function from the {plotly} package.
```{r eval = F}
cat_summary_plot <- show_plot(cat_summary)
ggplotly(cat_summary_plot)
```
Wonderful! Now you can hover over each of the bars to see the proportion of each bar section. For example you can now tell that 9% (0.090) of the cases have a suspected status:
![](images/plotly_inspect_cat.png){width="656"}
::: {.callout-note title='Reminder'}
The assignment arrow, `<-`, can be written with the RStudio shortcut **`alt`** + **`-`** (**alt** AND **minus**) on Windows or **`option`** + **`-`** (**option** AND **minus**) on macOS.
:::
------------------------------------------------------------------------
You can obtain a similar plot for the numerical (continuous) variables in the dataset with `inspect_num()`. Here, we show all three steps in one go.
```{r eval = F}
num_summary <- inspect_num(ebola_sierra_leone)
num_summary_plot <- show_plot(num_summary)
ggplotly(num_summary_plot)
```
This gives you an overview of the numerical columns, `age` and `id`. (Of course, the distribution of the `id` variable is not meaningful.)
You can tell that individuals aged 35 to 40 (mid-point 37.5) are the largest age group, making up 13.8% (0.1377...) of the cases in the dataset.
## Analyzing a single numeric variable
Now that you have a sense of what the entire dataset looks like, you can isolate and analyze single variables at a time---this is called *univariate analysis*.
Go ahead and create a new section in your script for this univariate analysis.
```{r eval = F}
## Univariate analysis, numeric variables ----
```
Let's start by analyzing the numeric `age` variable.
### Extract a column vector with `$`
To extract a single variable/column from a dataset, use the dollar sign, `$` operator:
```{r}
ebola_sierra_leone$age # extract the age column in the dataset
```
::: {.callout-note title='Vocab'}
This list of values is called a *vector* in R. A vector is a kind of data structure that has elements of one *type*. In this case, the type is "numeric". We will formally introduce you to vectors and other data structures in a future chapter. In this lesson, you can take "vector" and "variable" to be synonyms.
:::
### Basic operations on a numeric variable
To get the mean of these ages, you could run:
```{r}
mean(ebola_sierra_leone$age)
```
But it seems we have a problem. R says the mean is `NA`, which means "not applicable" or "not available". This is because there are some missing values in the vector of ages. (Did you notice this when you printed the vector?) By default, R cannot find the mean if there are missing values. To ignore these values, use the argument `na.rm` (which stands for "NA remove") setting it to `T`, or `TRUE`:
```{r}
mean(ebola_sierra_leone$age, na.rm = T)
```
Great! This need to remove the `NA`s before computing a statistic applies to many functions. The `median()` function for example, will also return `NA` by default if it is called on a vector with any `NA`s:
```{r}
median(ebola_sierra_leone$age) # does not work
```
```{r}
median(ebola_sierra_leone$age, na.rm = T) # works
```
------------------------------------------------------------------------
`mean` and `median` are just two of many R functions that can be used to inspect a numerical variable. Let's look at some others.
But first, we can assign the age vector to a new object, so you don't have to keep typing `ebola_sierra_leone$age` each time.
```{r}
age_vec <- ebola_sierra_leone$age # assign the vector to the object "age_vec"
```
Now run these functions on `age_vec` and observe their outputs:
```{r}
sd(age_vec, na.rm = T) # standard deviation
max(age_vec, na.rm = T) # maximum age
min(age_vec, na.rm = T) # minimum age
summary(age_vec) # min, max, mean, quartiles and NAs
length(age_vec) # number of elements in the vector
sum(age_vec, na.rm = T) # sum of all elements in the vector
```
Do not feel intimidated by the long list of functions! You should not have to memorize them; rather you should feel free to Google the function for whatever operation you want to carry out. You might search something like "what is the function for standard deviation in R". One of the first results should lead you to what you need.
### Visualizing a numeric variable
Now let's create a graph to visualize the age variable. The two most common graphics for inspecting the distribution of numerical variables are [histograms](https://www.mathsisfun.com/data/histograms.html) (like the output of the `inspect_num()` function you saw earlier) and [boxplots](https://www.mathsisfun.com/data/quartiles.html).
R has built-in functions for these:
```{r out.width="70%"}
hist(age_vec)
```
```{r out.width="70%"}
boxplot(age_vec)
```
Nice and easy!
Graphical functions like boxplot() and hist() are part of R's base graphics package. These functions are quick and easy to use, but they do not offer a lot of flexibility, and it is difficult to make beautiful plots with them. So most people in the R community use an extension package, {ggplot2}, for their data visualization.
In this course, we'll use ggplot indirectly; by using the {esquisse} package, which provides a user-friendly interface for creating ggplot2 plots.
The workhorse function of the {esquisse} package is `esquisser()`, and this function takes a single argument---the dataset you want to visualize. So we can run:
```{r eval = FALSE}
esquisser(ebola_sierra_leone)
```
This should bring a graphic user interface that you can use to plot different variables. To visualize the age variable, simply drag `age` from the list of variables into the x axis box:
![](images/23.30.37.jpg){width="441"}
When `age` is in the x axis box, you should automatically get a histogram of ages:
![](images/23.32.03.jpg){width="554"}
You can change the plot type by clicking on the "Histogram" button and selecting one of the other valid plot types. Try out the boxplot, violin plot and density plot and observe the outputs.
![](images/23.39.38.jpg){width="215"}
When you are done creating a plot with {esquisse}, you should copy the code that was created by clicking on the "Code" button at the bottom right then "Copy to clipboard":
![](images/23.48.18.jpg){width="403"}
Now, paste that code into your script, and make sure you can run it from there. The code should look something like this:
```{r eval = F}
ggplot(ebola_sierra_leone) +
aes(x = age) +
geom_histogram(bins = 30L, fill = "#112446") +
theme_minimal()
```
By copying the generated code into your script, you ensure that the data visualization you created is fully reproducible.
::: {.callout-note title='Pro Tip'}
{esquisse} can only create fairly simple graphics, so when you want to make highly customized or complex plots, you will need to learn how to write {ggplot} code manually. This will be the focus of a later course.
:::
You should also test out the other tabs on the bottom toolbar to see what they do: Labels & Title, Plot options, Appearance and Data.
::: {.callout-note title='Challenge'}
**Easy bivariate and multivariate plots**
In this lesson we are focusing on univariate analysis: exploring and visualizing one variable at a time. But with esquisse; it is *so* easy to make a bivariate or multivariate plot, so you can already get your feet wet with this.
Try the following plots:
- Drag `age` to the X box and `sex` to the Y box.
- Drag `age` to the X box, `sex` to the Y box, and `sex` to the fill box.
- Drag `age` to the X box and `district` to the Y box.
:::
## Analyzing a single categorical variable
Next, let's look at a categorical variable, the districts of reported cases:
```{r}
## Univariate analysis, categorical variables ----
ebola_sierra_leone$district
```
Sorry for printing that very long vector!
### Frequency tables
You can use the `table()` function to create a frequency table of a categorical variable:
```{r}
table(ebola_sierra_leone$district)
```
You can see that most cases are in Kailahun and Kenema.
------------------------------------------------------------------------
`table()` is auseful "base" function. But there is a better function for creating frequency tables, called `tabyl()`, from the {janitor} package.
To use it, you supply the name of your data frame as the first argument, then the name of variable to be tabulated:
```{r}
tabyl(ebola_sierra_leone, district)
```
As you can see, `tabyl()` gives you both the counts and the percentage proportions of each value. It also has some other attractive features you will see later.
::: {.callout-note title='Pro Tip'}
You can also easily make cross-tabulations with `tabyl()`. Simply add additional variables separated by a comma. For example, to create a cross-tabulation by district and sex, run:
```{r}
tabyl(ebola_sierra_leone, district, sex)
```
The output shows us that there were 0 women in the Bo district, 2 men in the Bo district, 91 women in the Kailahun district, and so on.
:::
### Visualizing a categorical variable
Now, let's try to visualize the `district` variable. As before, the best way to do this is with the `esquisser()` function from {esquisse}. Run this code again:
```{r eval = FALSE}
esquisser(ebola_sierra_leone)
```
Then drag the `district` variable to the X axis box:
![](images/00.33.04.jpg)
You should get a bar chart showing the count of individuals across districts. Copy the generated code and paste it into your script.
## Answering questions about the outbreak
With the functions you have just learned, you have the tools to answer the questions about the Ebola outbreak that were listed at the top. Give it a go. Attempt these questions on your own, then look at the solutions below.
- **When was the first case reported? (Hint: look at the date of sample)**
- **As at the end of June 2014, which 10-year age group had had the most cases?**
- **What was the median age of those affected?**
- **Had there been more cases in men or women?**
- **What district had had the most reported cases?**
- **By the end of June 2014, was the outbreak growing or receding?**
------------------------------------------------------------------------
**Solutions**
- **When was the first case reported?**
```{r}
min(ebola_sierra_leone$date_of_sample)
```
We don't have the date of report, but the first "date_of_sample" (when the Ebola test sample was taken from the patient) is May 23rd. We can use this as a proxy for the date of first report.
- **What was the median age of cases?**
```{r}
median(ebola_sierra_leone$age, na.rm = T)
```
The median age of cases was 35.
- **Are there more cases in men or women?**
```{r out.width="70%"}
tabyl(ebola_sierra_leone$sex)
```
As seen in the table, there were more cases in women. Specifically, 57% of cases are of women.
- **What district has had the most reported cases?**
```{r out.width="70%"}
tabyl(ebola_sierra_leone$district)
## We can also plot the following chart (generated with esquisse)
ggplot(ebola_sierra_leone) +
aes(x = district) +
geom_bar(fill = "#112446") +
theme_minimal()
```
As seen, the Kailahun district had the majority of cases.
- **By the end of June 2014, was the outbreak growing or receding?**
For this, we can use esquisse to generate a bar chart that shows a count of cases in each day. Simply drag the `date_of_onset` variable to the x axis. The output code from esquisse should resemble the below:
```{r out.width="70%"}
ggplot(ebola_sierra_leone) +
aes(x = date_of_onset) +
geom_histogram(bins = 30L, fill = "#112446") +
theme_minimal()
```
Great! But it is debatable whether the outbreak was growing or receding at the end of June 2014; a precise trend is not really clear!
## Haven't had enough?
If you would like to practice some of the methods and functions you learned on a similar dataset, try downloading the data that is stored on this page: <https://bit.ly/view-yaounde-covid-data>
That dataset is in the form of an Excel spreadsheet, so when you are importing the dataset with RStudio, you should use the "From Excel" option (File \> Import Dataset \> From Excel).
This dataset contains the results of a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be obtained from here: [go.nature.com/3R866wx](https://go.nature.com/3R866wx){target="_blank"}
## Wrapping up
Congratulations! You have now taken your first baby steps in analyzing data with R: you imported a dataset, explored its structure, performed basic univariate analysis and visualization on its numeric and categorical variables, and you were able to answer important questions about the outbreak based on this.
Of course, this was only a *sneak peek* of the data analysis process---a lot was left out. Hopefully, though, this sneak peek has gotten you a bit excited about what you can do with R. And hopefully, you can already start to apply some of these to your own datasets. The journey is only beginning! See you soon.
<!-- Only team members who contributed "substantially" to a specific lesson should be listed here -->
<!-- See https://tinyurl.com/icjme-authorship for notes on "substantial" contribution-->
## References {.unlisted .unnumbered}
Some material in this lesson was adapted from the following sources:
- Barnier, Julien. “Introduction à R Et Au Tidyverse.” Partie 13 Diffuser et publier avec rmarkdown, May 24, 2022. <https://juba.github.io/tidyverse/13-rmarkdown.html>.
- Yihui Xie, J. J. Allaire, and Garrett Grolemund. “R Markdown: The Definitive Guide.” Home, April 11, 2022. <https://bookdown.org/yihui/rmarkdown/>.
<!-- (Chicago format. You can use https://www.citationmachine.net) -->
`r tgc_license()`