-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathR-workshop-1-NRS.Rmd
495 lines (368 loc) · 15.1 KB
/
R-workshop-1-NRS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
---
title: "Introduction to R"
author: "Alan R. Pearse, Natural Resources Society"
date: "24 July 2018"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Preamble
R is a high-level statistical programming language. It is widely used for manipulating, graphing and modelling data in a range of fields, from ecology to finance and beyond.
This workshop is an introduction to the basics of R. You will learn:
1. How to navigate around the RStudio interface; a free Interactive Development Environment (IDE) for R;
2. How to install and load packages that provide extra functionality for R;
3. How to import data into R from spreadsheets;
4. How to manipulate and summarise data in R; and
5. How to produce high-quality visualisations of data in R.
## The RStudio IDE
This section is hands-on. You will be taken through the RStudio environment by the workshop demonstrator.
## Installing and loading packages
R has a wide range of functionality built-in, but, being an open-source software, members of the R community have built and published plug-in 'packages' that extend R's functionality to do more complicated tasks. In this section, you will learn how to install and load several packages that we will use throughout the workshop.
To install a package, make sure you have an internet connection. On the QUT computers, make sure you have external internet access. The workshop demonstrator can help you with this.
Once you're sure that you have internet access, run the following commands:
```{r, eval = FALSE}
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")
```
Once the installation commands have run successfully, try to avoid running them again. This may cause problems due to folder permission issues which arise when R tries to overwrite an existing instance of a package.
Installing a package *does not* mean you can immediately access the functionality in the package. You have to load them at the beginning of every R session to use them. To load packages, run the following commands:
```{r}
library(dplyr)
library(tidyr)
library(ggplot2)
```
You now have access to the functionality of these packages!
## Setting up a session in R
When people use R, they usually import their data from files that exist somewhere on their computer. These have to be brought into the R environment by running specific commands. However, we can simplify the data loading process by *setting the working directory*, which is the folder where R looks, by default, to find files which you refer to in your code.
There are at least three ways of setting the working directory.
1. By running the command `setwd()` with the path to a folder on your computer.
2. By running the command `setwd()` containing another command which allows you to select the relevant folder in Windows explorer (only on Windows OS),
3. By point-and-click methods inside RStudio.
The workshop demonstrator can help you with option (3). But for options (1) and (2), see the code blocks below:
```{r, eval = FALSE}
# setwd with path to folder written explicitly.
# You will need to change this to suit your own computer and needs.
setwd("D:/Dropbox/NRS")
# setwd with command to select a folder inside Windows explorer.
setwd(choose.dir())
```
This can be the difference between having to write
```{r, eval = FALSE}
# Without setwd
read.csv("D:/Dropbox/NRS/meuse.csv")
```
and
```{r,eval = FALSE}
# With setwd
read.csv("meuse.csv")
```
Try setting your working directory to the folder where you've unzipped the contents of the workshop pack.
## Importing data from spreadsheets
Most data you will encounter lives in spreadsheets. You will have to access the spreadsheets from inside R if you want to use them in an R-based workflow.
For this workshop, we will assume that the data of interest lives in a comma-separated-values file (.csv). We will use the command `read.csv` to access these. However, there are functions in R for reading excel spreadsheets, text files, database tables, and even google forms. Since there is such a wide range of options, we can't deal with all of them. But .csv files are a good place to start, especially for scientific applications.
We assume you have set your working directory to the folder which contains meuse.csv. This file came with your workshop pack. To read this data into R, run the following command:
```{r}
meuse <- read.csv(
file = "meuse.csv",
as.is = TRUE
)
```
Note, the `<-` is an assignment operator. It is equivalent to `=`. It just sets the value of the name on its left-hand side to the data/object/thing on its right-hand side.
If the code ran successfully, you won't see anything. To check that you got the result you expected, we have to run a simple command on the data. Let's use `names`; a function which displays all the column names of a dataset.
```{r}
names(meuse)
```
If you're seeing something like the above, everything is fine.
## Writing out data
You can take an object inside R and write it to a spreadsheet-like file by using
```{r, eval = FALSE}
write.csv(
x = meuse,
file = "a_new_csv.csv",
row.names = FALSE
)
```
## A more formal introduction to data structures in R
R is an object-oriented language, so most of its functionality is centred around 'classes' and 'objects'. A class is an abstraction, like when we refer to matrices and their properties, we refer to matrices in general and not necessarily to a particular matrix. An object comes from a class, just like
```{r, echo = FALSE}
matrix(runif(4), nrow = 2)
```
is an object of class `matrix`.
Classes define what sort of operations are possible with a particular data structure. In this workshop, we will be dealing with objects of class `data.frame`.
The `meuse` object we have created is a `data.frame`. We can query the class of objects using the command `class`.
```{r}
class(meuse)
```
The `data.frame` class is very versatile. Each `data.frame` contains a specific number of rows or observations
```{r}
nrow(meuse)
```
and columns or variables
```{r}
ncol(meuse)
```
with set names for the variables
```{r}
names(meuse)
```
Each column contains a different type of information. Some might contain discrete variables and others might contain numbers. But all these data types exist within the one object. If it's hard to take in, don't worry too much. Objects of class `data.frame` are so ubiquitous that most R functions can deal with them. The functions that we will see inside this workshop can all deal with `data.frame` objects.
## Viewing data
You can view an entire object inside your console by simply running a command containing the name of the object, like this:
```{r, eval = FALSE}
meuse
```
But most datasets contain too many rows to display conveniently inside the console.
A more common approach is to look at the first and last six rows of a `data.frame` object using the `head` and `tail` commands.
```{r}
head(meuse) # first six
tail(meuse) # last six
```
These kinds of outputs are much more manageable for the console.
If you want to view your entire dataset in a scrollable window inside RStudio, you can run
```{r, eval = FALSE}
View(meuse)
```
## Accessing data and subsets of data
There are times when we only want or need one or two columns from a `data.frame` or a few rows. In these situations, you have to know how to access specific portions of your data.
Let's start with row and column indexing. You can refer to specific rows and columns of matrices and data frames like this:
```{r, eval = FALSE}
meuse[1,] # This gets the first row
meuse[,1] # This gets the first column
meuse[1,1] # This gets the first row of the first column
```
To access multiple rows or multiple columns, we extend the indices with vectors.
```{r, eval = FALSE}
meuse[c(1,2,3), ]
meuse[1:3, ] # These get the first three rows
meuse[, c(1,2,3)]
meuse[, 1:3] # These get the first three columns
```
More often, it's useful to be able to access entire columns by name. Say, for example, you want to make a histogram of the zinc concentrations around the Meuse river. Then we write
```{r}
hist(meuse$zinc)
```
We can do calculations on entire columns conveniently using this `$` notation. For example, for a histogram of log(zinc concentrations), we write
```{r}
hist(log(meuse$zinc))
```
Another common situation is where you have to find some parts of your dataset which conform to a set of logical criteria. We can find these using the `subset` function. It is used like this:
```{r}
meuse_sub <- subset(
meuse, # the name of the dataset
zinc < 200 # the logical condition
)
head(meuse_sub)
```
We can combine multiple logical conditions. For example,
```{r}
meuse_sub <- subset(
meuse,
zinc < 200 & lead < 200
) # Must satisfy both conditions
head(meuse_sub)
meuse_sub <- subset(
meuse,
zinc < 200 | lead < 200
) # Can satistfy either condition
head(meuse_sub)
```
## Doing calculations with data
Often it is useful to be able to compute summary statistics of columns in your data. In situations like these, we can write
```{r}
mean(meuse$zinc)
sd(meuse$zinc)
```
If we want to calculate entirely new columns of data based on existing columns of data, we can write things like
```{r}
meuse$log_zinc <- log(meuse$zinc)
```
In this case, we refer to a column in the dataset (it doesn't have to exist already) and define it based on a calculation involving a column in the dataset which already exists.
## Advanced summaries of data
Being able to compute summary statistics on columns of data is important. However, this isn't necessarily useful when your dataset contains distinct groups of observations and you need to be able to calculate statistics within each of those groups.
In the `meuse` dataset, there are three types of soil (just labelled 1, 2, 3) where concentrations of various heavy metals were observed. We might want to know the mean concentration of lead, cadmium and copper in each of the soil types.
To do this, we need to use the `group_by` and `summarise` functions inside the package `dplyr`.
In the first step, we identify the column which contains the grouping variable for our dataset. In this case, it's called 'soil'.
```{r}
meuse_soil_groups <- group_by(
meuse, # The dataset
soil # The grouping variable
)
```
Note that there is superficially nothing different about the new `data.frame` object. The changes have occurred under the hood, in the internal logical of R, so that when we perform the summaries on cadmium, lead and copper concentrations, we get the results we want.
The next step involves using the `summarise` function like so
```{r}
summarise(
meuse_soil_groups, # The grouped dataset
Mean_Cd = mean(cadmium),
Mean_Pb = mean(lead),
Mean_Cu = mean(copper)
# Format: name = calculation(column)
)
```
Note that this is quite different to
```{r}
summarise(
meuse,
Mean_Cd = mean(cadmium),
Mean_Pb = mean(lead),
Mean_Cu = mean(copper)
)
```
We can do more than just calculate means. Virtually any function is allowed to be used inside `summarise`, as long as its output is a single value. To demonstrate, if we want the mean, median, standard deviation and sum of the lead concentrations in each soil group, we would write
```{r}
summarise(
meuse_soil_groups,
Mean = mean(lead),
Median = median(lead),
StdDev = sd(lead),
Sum = sum(lead)
)
```
Note also that the names we give to the new columns are completely arbitrary, as long as they are valid column names; i.e. they don't contain special characters or start with numbers.
## Graphical summaries of data
One of the main attractions of R is the ability to create publication-quality figures with ease. In this section, you will learn about two different ways to plot data in R:
1. `base` plot
2. `ggplot2`
Base plot is the graphics device that comes pre-packaged with every installation of R. You can learn it quickly but it takes a long time to master, and for complicated tasks it is very difficult to use. It is characterised by the use of single-line commands like
```{r}
plot(log_zinc ~ dist, data = meuse)
```
The package `ggplot2` provides a different framework for plotting, which takes a bit of time to learn but, once you learn it, you can create beautiful plots in very little time. It is characterised by the use of longer, multi-line commands that are built up layer-by-layer to produce a plot.
```{r}
ggplot(
data = meuse,
aes(x = dist, y = log_zinc)
) +
geom_point()
```
### Histograms
You can make histograms in base plot by using `hist`.
```{r}
hist(
meuse$log_zinc, # A single column of data
xlab = "log(zinc concentration)",
# xlab for x-axis label
main = ""
# main for plot title
)
```
In ggplot2, the plot is opened using `ggplot` and then the histogram is drawn using `geom_histogram`.
```{r}
ggplot(
data = meuse, # The dataset
aes(
x = log_zinc # Name of column to draw in histogram
)
) +
geom_histogram(
col = "black", # for bar outline
fill = "grey70" # for bar filling
) + # Draws a histogram
theme_bw() + # Controls background display
labs(
x = "log(zinc concentration)",
# x for x-axis label
title = ""
# title for plot title
)
```
### Boxplots
Boxplots for a single variable can be drawn using `boxplot`.
```{r}
boxplot(
meuse$log_zinc,
main = "",
xlab = "",
ylab = "log(zinc concentration)"
)
```
We can split the boxplots out by a discrete variable, also.
```{r}
boxplot(
log_zinc ~ soil,
data = meuse, # the dataset
main = "",
xlab = "Soil type",
ylab = "log(zinc concentration)"
)
```
In ggplot2, we can do the same things but the single-variable boxplot requires more code to create.
```{r}
ggplot(
data = meuse,
aes(x = 0, y = log_zinc)
) +
geom_boxplot() +
theme_bw() +
labs(
x = "",
y = "log(zinc concentration)",
title = ""
) +
theme(
axis.ticks.x = element_blank(),
axis.text.x = element_blank()
) # Removes text from x axis
```
However, it is slightly more elegant for the boxplot split out by a factor.
```{r}
ggplot(
data = meuse,
aes(x = as.factor(soil), y = log_zinc)
) +
geom_boxplot() +
theme_bw() +
labs(
x = "Soil type",
y = "log(zinc concentration)",
title = ""
)
```
### Scatter plots
Simple scatter plots are easy to create in base plot and ggplot2. We have already seen examples of these.
However, when it comes to more advanced scatter plots, with points coloured by a third variable, and with a legend, ggplot2 is much more elegant than base plot.
```{r}
# base plot
meuse$col <- NA # must create separate column with colour names
meuse$col[meuse$soil == 1] <- "red"
meuse$col[meuse$soil == 2] <- "green"
meuse$col[meuse$soil == 3] <- "blue"
plot(
log_zinc ~ dist,
data = meuse,
col = meuse$col,
xlab = "Distance from river (km)",
ylab = "log(zinc concentration)",
main = ""
)
legend(
x = "topright",
col = c("red", "green", "blue"),
pch = c(1, 1, 1),
legend = c("1", "2", "3")
) # Legend must be plotted separately
# ggplot2
ggplot(
data = meuse,
aes(
x = dist,
y = log_zinc
)
) +
geom_point(
aes(
col = as.factor(soil) # Simply map variable to colour
)
) +
theme_bw() +
labs(
x = "Distance from river",
y = "log(zinc concentration)",
col = "Soil type" # Label legend
) +
theme(
legend.position = "bottom" # legend placement
)
```