-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathjump_essentials.qmd
465 lines (300 loc) · 14.1 KB
/
jump_essentials.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
---
title: "Jumping essentials"
webr:
packages: ['ggplot2', 'dplyr', 'janitor']
editor:
mode: source
---
## The essentials
Before you jump into the water it can be of benefit when you know more about the water. What is the temperature? Is it really cold or just nice and warm? How high is the jump? Do you need to jump first 5 meters from a diving board or can you already feel the water with your toes?
This first chapter will give some basic programming essentials that will allow you to jump easier. Also it can be used as a reference for when you need to make the jump again
### Find your info online and in documentation
R has so many functions that it is impossible to know everything by heart. Documentation and the internet are always your best friend.
`Stackexhange` is an excellent resource. 90 to 99% of your questions related to how you should use your R and tidy functions has been asked before by others. The nice thing is that, most often, the questions include a snippet of code that makes the problem reproducible. More importantly, almost all questions have been accurately answered in multiple ways.
Other resources that often come up in my search results are either forums on `POSIT community`, `Reddit`, or `Github discussions or issues`. These are more forum-like comments, with not such a good solvability structure as stackexchange.
Then there are many more resources that somehow scrape the internet and collect basic info. Most of the time the info is correct but too simplistic. Not real issues are tackled. These are sites like `geeksforgeeks`, `datanovia`, `towardsdatascience`, some have better info then others, but most of the time these have commercial activities and in the end want to sell you courses or get your clicks.
The good news is that **you don't have to remember any of these sites by heart. Just type your question on your favorite search engine**, ideally copypasting the exact error message, and your answer would very likely be found.
### R and tidyverse documentation
All functions in R and tidyverse are accurately documented. All its arguments are described and especially the `examples` that are given are really helpful. Packages have often even more documentation called `vignettes` that explain certatin topics and contexts on how and when to use the functions.
### Style and layout
Writing your code benefits from proper readability. Just like we layout our texts, manuscripts and excel data files, we also need a good layout for our code.
``` {webr-r}
library(ggplot2)
# NOT VERY READABLE (but runnable )
ggplot(data=mtcars, mapping=aes (x = mpg,y = disp,
color = hp,shape = as.factor(cyl)) ) +geom_point()
```
There are mulitple ways to organize your code, I try to adhere to:
- short lines (max 60 characters per line)
- indent after first line
- indent after ggplot
- each next function call aligns with the above function
- each argument aligns with the previous argument
- each ggplot layer gets its own line
- I put the x and y aesthetics for ggplot mapping on one line
Other good practices are:
- use the package name before a function, like `dplyr::mutate`
- use comments to annotate the code, when you put a `#` before it, it is not executed
So here is an example on what not to do and its corrections
``` {webr-r}
library(dplyr)
#NOT GOOD
iris %>%
as_tibble() %>% janitor::clean_names() %>%
filter(species
%in% c("setosa", "virginica")) %>%
ggplot(aes(x = sepal_length, y = sepal_width,group = petal_length, color = petal_width))+
geom_point()+ geom_line() +
theme_bw(base_size = 16)
#GOOD
iris %>%
as_tibble() %>%
janitor::clean_names() %>%
filter(species %in% c("setosa", "virginica")) %>%
ggplot(aes(x = sepal_length, y = sepal_width,
group = petal_length,
color = petal_width))+
geom_point()+
geom_line() +
theme_bw(base_size = 16)
```
### Calling packages and functions
There are two ways of calling functions in R, the most straightforward and easy one is just to call it with its name. This works only when you run a function from base R. When you want a function from another package, you can either first load the package with `library(your_favorite_package)` and then call your function with `my_favorite_function(my_argument)`. Another, preferred way, is to always explicitly mention the package from which the function comes from. It can happen that two different packages implement a function with the same name, leading to confusion. In that case, you need to be careful with the `library` loading, because then the function might be masked by another function with the same name from another package
``` {webr-r}
#option 1
library(janitor)
clean_names(iris) %>%
head()
#option 2 (preferred)
janitor::clean_names(iris) %>%
head()
#which is the same as:
iris %>%
janitor::clean_names() %>%
head()
```
## Basic R semantics
When starting using R and tidyverse the new language can be daunting. So here is a short primer of common semantics that are often not directly understood from code.
I took some of these example directly or indirectly from:
<https://uc-r.github.io/basics>
### Assignment
The most common way of assigning in R is the `<-` symbol. Although the `=` works in the same way, it is reserved by R users for other things. I tend to use it for assigning numbers to constants, and it is used in function arguments
``` {webr-r}
#assignment
x <- 1
#is the same as:
x = 1
#but the <- is preffered
print(paste0("my x = :", 1))
```
### Vectors and lists
A `vector` in R is a collection of items (elements) of the same kind (types). A `list` is a collection of items that can also have different types. We make a vector with `c()` and a list with `list`. The `c` in `c()` apparently stands for `combine` [link](https://stackoverflow.com/questions/11488820/why-use-c-to-define-vector)
``` {webr-r}
#vectors
x <- c(1, 2, 3)
y <- c ("aap", "noot", "mies")
x
y
```
``` {webr-r}
#lists
x <- list(1, 2, 3)
y <- list("aap", "noot", "mies", 1, c(22, 23, 25))
x
y
```
If you try to build a vector with elements of different types, R will try to adapt all of them to a single type. You can see that when you specify a vector with numbers and characters eg. `c(1, 2, "1", "2)`. It forces the vector to be of `character` type. While it may look handy that R does this for us, it is a dangerous feature that might lead to wrong inputs going unnoticed.
``` {webr-r}
#other vector semantics
x <- 1:10
#is the same as
x <- c(1:10)
#is the same as
x <- c(1,2,3,4,5,6,7,8,9,10)
#you can multiply all elements of a vector at the same time
x * 3
# or:
y <- 3
x * y
# or:
x / y
# also adding y to x will add 3 to each element
x + y
# you can also extend or combine two vectors
z <- 20:25
c(x, z)
```
Lists form the basis of all other data than vectors. Dataframes are collections of related data with rows and columns and unique columns names and row names (or row numbers). `data.frame` is actually a wrapper around the `list` method.`Tibbles` are the tidyverse equivalent of `dataframes` with some more handy properties over dataframes. A 'list' can have names items or not.
``` {webr-r}
#a list without named items
my_list <- list(1:10, letters[1:10], LETTERS[1:10])
#a list with named items
my_list <- list(my_numbers = 1:10,
my_lowercase = letters[1:10],
my_uppercase = LETTERS[1:10])
#this almost looks like a table, it only is not in a matrix format
#turning the list into a dataframe generates a table
as.data.frame(my_list)
#which is similar to making it a tibble
as_tibble(my_list)
#when the columns are not of the same length the df or tibble
#cannot be generated
my_list_2 <- list(my_numbers = 1:10,
my_lowercase = letters[1:10],
my_uppercase = LETTERS[1:10])
as.data.frame(my_list_2)
```
### Common semantics
R language is different from other programming languages, and when starting out learning R there are some rules and common practices.
### ~ (the "tilde")
``` {webr-r}
#the primary use case is to separate the left hand side
#with the right hand side in a formula
y ~ a*x+ b
#the ~ is also used in the ggplot facet_wrap or facet_grid
#it can be read as "by"
# separate the ggplot "by" cyl
mtcars %>%
select(mpg, cyl, disp) %>%
ggplot(aes(x = mpg, y = disp))+
geom_point()+
facet_wrap(~cyl)
#the tilde use in the facet_wrap is discouraged though,
#we now use vars() instead of the tilde
mtcars %>%
select(mpg, cyl, disp) %>%
ggplot(aes(x = mpg, y = disp))+
geom_point()+
facet_wrap(vars(cyl))
```
### + (the plus)
Apart from the simple arithmetic addition, `+` is also used in the ggplot functions. It adds the multiple layers to each ggplot
``` {webr-r}
mtcars %>%
select(mpg, cyl, disp) %>%
ggplot(aes(x = mpg, y = disp))+
geom_point()+
geom_line()+
geom_boxplot()+
labs(title = "Crazy plot")
```
### %>% (the pipe)
The `%>%` is used to forward an object to another function or expression. It was first introduced in the `magrittr` package and is now also introduced in base R as the `|>` pipe, which are now identical. See [blogpost](http://adolfoalvarez.cl/blog/2021-09-16-plumbers-chains-and-famous-painters-the-history-of-the-pipe-operator-in-r/) for more info.
``` {webr-r}
mtcars %>%
select(mpg, cyl, disp) %>%
mutate(new_column = mpg*cyl) %>%
filter(new_column > 130)
```
### == (equal to)
The `==` is the equal to operator. It is different than `=` which is used only for assignment.
``` {webr-r}
#the equal to is validating whether the left hand side
#is the same as the right hand side and its output is TRUE or FALSE
7 == 7
#generates TRUE wheres
6 == 7
#generates FALSE
```
### aes (aesthetics in ggplot)
The `aes` is important for telling the ggplot what to plot. `aes` are the aesthetics of the plot that need to mapped to data. So the ggplot needs `data` and `mappings`.
The `ggplot` acronym is actually coming from the `grammar of graphics`, which is a book "The grammar of graphics" by Leland Wilkinson, and was used by Hadley Wickham to make the `ggplot` package in 2005.
A `ggplot` consists of:
- data
- aestehtic mappings (like x, y, shape, color etc)
- geometric objects (like points, lines etc)
- statistical transformations (stat_smooth)
- scales
- coordinate systems
- themes and layouts
- faceting
``` {webr-r}
#ggplot basics with one geometric object "geom_point"
#and several aesthetics
ggplot(data = mtcars,
mapping = aes(x = mpg,
y = disp,
color = hp,
shape = as.factor(cyl))) +
geom_point()
```
### %in% (match operator)
This is handy to check and filter specific elements from a vector
``` {webr-r}
my_groups <- c("50.000", "100.000", "150.000")
"50.000" %in% my_groups #generates TRUE
#and the other way around
my_groups %in% c("50.000", "100.000")
#this is usefull when filtering specific elements in a tibble
iris %>%
filter(Species %in% c("setosa", "virginica"))
```
## Practical tips
### Running your code
Webr code in the browser can be run as a complete code block by clicking on the `Run code` button when the webr status is `Ready!`, right above the block.

Another option is to select a line of code (or more lines) and press `command or ctrl enter`. This will execute only the line or lines that you have selected.
### Simple troubleshooting your pipelines and ggplots
It happens that your code is not right away typed in perfectly, so you will get errors and warnings. It is good practice to break down your full code block or pipe into parts and observe after which line of code the code is not working properly.
## ADVANCED EXERCISE
### Building your data visualisation step by step
Let's take a built-in R dataset `USArrests`. We want to visualize how the relative number of murders in the state Massachusetts relates to the other states with the highest urban population in those state. In the dataset, the `murder` column represents the `number of murders per 100.000 residents`
``` {webr-r}
USArrests
head(USArrests)
glimpse(USArrests)
#please note that the states are listed as rownames. The glimpse does not show the rownames!
```
::: callout-tip
#### Exercise x
Make a plot that addresses the above dataviz problem.
``` {webr-r}
USArrests #%>%
#......
#......
#......
#......
#ggplot.....
#.......
#etc
```
::: {.callout-note collapse="true"}
#### HINTS
Hints:
Do the following in your coding:
* `glimpse` at the data and look at the top5 rows using `head()`
* use `tibble::rownames_to_column()` to make a separate column called `states`
* clean the column names using `janitor::clean_names()`
* turn the datatable into a `tibble` using 'as_tibble'
* take only the the top states by using a filter on the urban population (take it higher than 74)
* plot the data using a `geom_col`
* label the x axis and not the y-axis
* highlight the massachusetts column using a separate `geom_col` layer, were you put a filter on the original data by using in the `geom_col` a call to `data = . %>% filter(str_detect(states, "Mass"))`. Also give this bar a red color.
* apply a nice theme so that there are only x axis grid lines and no lines for y and x axis.
* Also make sure that x-axis starts at zero
* Use the `forcats::refactor()` to sort the states on the y-axis from highest murder to the lowest murder rate.
Include all these aspects step by step.
:::
:::
::: {.callout-caution collapse="true"}
#### Solution to Exercise x
``` {webr-r}
#webr::install("forcats")
USArrests %>%
tibble::rownames_to_column(var = "states") %>%
janitor::clean_names() %>%
as_tibble() %>%
filter(urban_pop > 74) %>%
ggplot(aes( x = murder,
y = forcats::fct_reorder(states, murder)))+
geom_col(fill = "grey70")+
geom_col(data = . %>%
filter(stringr::str_detect(states, "Mass")),
fill = "red")+
labs(y = "",
x = "number of murders per 100.000 residents")+
scale_x_continuous(expand = c(0,0))+
theme_minimal(base_size = 18)+
theme(panel.grid.major.y = element_blank())
```
:::