forked from laderast/data_storytelling_bdc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-highlighting-subsets.Rmd
160 lines (116 loc) · 5.31 KB
/
05-highlighting-subsets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "05-highlighting with subsets"
output: html_notebook
---
If you've skipped ahead (and not run Part 1), run this code chunk to load the relevant data and plot:
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(here)
source(here("R/init.R"))
```
## Things that will make this section easier to understand
Understanding `dplyr` and how `mutate()` and `filter()` work will help you out a lot in these slides.
If you want to learn the basics of `dplyr`, please check out R-Bootcamp Chapter 3: https://r-bootcamp.netlify.com/chapter3
## Optional: Making a Dummy Variable to color specific lines in our dataset
We can manually color a single group in our data through a two step process. For example, if we want to highlight "Breaking Bad" but not the other shows we can:
1. make a "dummy" variable called `breaking_bad` by recoding the `genre` variable to have `Y` when this variable has "Drama" in it, and `N` for the other categories.
2. manually color the lines using `scale_color_manual()` by specifying a `values()` argument.
In order to make our new variable, we will use a command called `mutate()` to search for "Breaking Bad" in the `title` variable.
### Using a dummy variable in our dataset (highlight a single show)
What if we wanted to highlight a single tv show in the data, such as `Breaking Bad`?
We can create a dummy variable in our dataset, much like `status`. In this one, we want to create a variable called `breaking_bad` where the value is `Y` (yes) where the `title` is `Breaking Bad` and `N` (no) otherwise:
We can add a new variable using what is called a `mutate()` function. It lets us calculate a new variable, based on the other values of the variable. If you've used Excel's `IF()` function, it's very similar in terms of the syntax. We first provide a condition:
```{r}
tv_shows_breaking <- tv_shows %>%
mutate(breaking_bad =
case_when(
title == "Breaking Bad" ~ "Y",
TRUE ~ "N"
))
```
## Highlighting multiple lines using `case_when`:
```
mutate(
#check whether a cell has the word "Comedy" in it
#If it does, output a "Y",
comedy = case_when(
str_detect(genres, "Comedy") ~ "Y",
#otherwise, output a "N"
TRUE = "N)
}
```
The `ifelse()` function can be read:
> If the value for `genres` contains "Comedy" THEN recode that value as "Y",
> If it doesn't contain "Comedy", THEN recode that value as "N"
Run the code below and explore the table a little bit to confirm that "Comedy" shows are coded as `Y` in our new variable, `comedy`:
```{r}
tv_shows_new <- tv_shows %>%
mutate(
#check whether a cell has the word "Comedy" in it
#If it does, output a "Y",
comedy = case_when(
str_detect(genres, "Comedy") ~ "Y",
#otherwise, output a "N"
TRUE ~ "N"
)
)
tv_shows_new
```
Now we can use `scale_color_manual()` to remap the `TRUE` and `FALSE` values.
```{r}
ggplot(tv_shows_new) +
aes(x = seasonNumber, y= av_rating, group=title,
color = comedy) +
geom_line() +
scale_color_manual(values=c("Y"="blue",
"N" = "grey")) +
scale_x_continuous(breaks=c(1:10))
```
## Optional: Highlighting a subset of the data
This is an optional section, but it shows you a specific trick in highlighting your data.
What if we wanted to highlight a specific time period in our dataset? We can actually subset the data, and plot over the original data with a different color to highlight it.
We want to specify this criteria in our `filter()` statement:
```{r}
tv_subset <- tv_shows %>%
filter(seasonNumber > 2) %>%
filter(seasonNumber < 6)
```
Now we can use this dataset to highlight this portion of the data:
```{r}
ggplot(tv_shows) +
aes(x = seasonNumber, y= av_rating, group=title) +
#color all the data - notice we pass a color argument directly in
geom_line(color = "grey") +
#color the subset - notice that pass a color argument in directly
#also notice that we specify `data` to be our `tv_subset`
geom_line(data = tv_subset, color = "blue") +
geom_vline(xintercept = 3, lty=3) +
geom_vline(xintercept = 5, lty=3) +
annotate("text", x=4, y=7, label="This is a critical\ntime period") +
labs(title = "Seasons 3 to 5 are Critical") +
theme_minimal() +
scale_x_continuous(breaks=c(1:10))
```
Now that you know how to use subsets in plotting, you can use this to highlight a group of points as well in a scatterplot as well. Here's an example:
```{r}
high_share_rated <- tv_shows %>% filter(share > 9 & av_rating > 8.5)
ggplot(tv_shows) +
aes(x = av_rating, y=share) +
geom_point(color = "grey") +
geom_point(data = high_share_rated, color = "blue")
```
Here's an example with annotations.
```{r}
high_share_rated <- tv_shows %>% filter(share > 9 & av_rating > 8.5)
ggplot(tv_shows) +
aes(x = av_rating, y=share) +
geom_point(color = "grey") +
geom_point(data = high_share_rated, color = "blue") +
labs(title = stringr::str_wrap("Only 'Breaking Bad' was high in both ratings and share")) +
geom_hline(yintercept = 9, lty=3) +
geom_vline(xintercept = 8.5, lty = 3) +
#annotate("text", x=9.1, y=15, label="Breaking\nBad") +
annotate("text", x=9.1, y=18, label= "High Share\nHigh Rating") +
annotate("text", x=6.5, y=5, label = "Low Share\nLow Rating") +
theme_minimal()
```