forked from alexd106/Rbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06-stats.Rmd
executable file
·445 lines (295 loc) · 33.2 KB
/
06-stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
# Simple Statistics in R {#stats_r}
>Note: The content of this Chapter is a very old draft which we've included here as a placeholder. Although you may still find some of the the content useful we are in the process of substantially rewriting this Chapter.
\
In addition to R’s powerful data manipulation and graphics facilities, R includes a host of procedures which you can use to analyse your data. Many of these procedures are included with the base installation of R, however, even more can be installed with packages available from the CRAN website (see [Chapter 1](#packages) for more details). All of the analyses described in this Chapter can be carried out without installing additional packages.
## One and two sample tests
The two main functions for these types of tests are the `t.test()`\index{t.test()} and `wilcox.test()`\index{wilcox.test()} that perform *t* tests and Wilcoxon’s signed rank test respectively. Both of these tests can be applied to one and two sample analyses as well as paired data.
As an example of a one sample *t* test we will use the `trees` dataset which is included with R. To access this in-built dataset we can use the `data()`\index{data()} function. This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees (see `?trees` for more detail).
```{r ss1, echo=TRUE, collapse=TRUE}
data(trees)
str(trees)
summary(trees)
```
If we wanted to test whether the mean height of black cherry trees in this sample is equal to 70 ft (`mu = 70`), assuming these data are normally distributed, we can use a `t.test()` to do so.
```{r ss2, echo=TRUE, collapse=TRUE}
t.test(trees$Height, mu = 70)
```
The above summary has a fairly logical layout and includes the name of the test that we have asked for (`One Sample t-test`), which data has been used (`data: trees$Height`), the *t* statistic, degrees of freedom and associated *p* value (`t = 5.2, df = 30, p-value = 1e-05`). It also states the alternative hypothesis (`alternative hypothesis: true mean is not equal to 70`) which tells us this is a two sided test (as we have both equal and not equal to), the 95% confidence interval for the mean (`95 percent confidence interval:73.66 78.34`) and also an estimate of the mean (`sample estimates: mean of x : 76`). In the above example, the *p* value is very small and therefore we would reject the null hypothesis and conclude that the mean height of our sample of black cherry trees is not equal to 70 ft.
The function `t.test()` also has a number of additional arguments which can be used for one-sample tests. You can specify that a one sided test is required by using either `alternative = "greater"` or `alternative = "less` arguments which tests whether the sample mean is greater or less than the mean specified. For example, to test whether our sample mean is greater than 70 ft.
```{r ss3, echo=TRUE, collapse=TRUE}
t.test(trees$Height, mu = 70, alternative = "greater")
```
You can also change the confidence level used for estimating the confidence intervals using the argument `conf.level = 0.99`. If specified in this way, 99% confidence intervals would be estimated.
Although *t* tests are fairly robust against small departures from normality you may wish to use a rank based method such as the Wilcoxon’s signed rank test. In R, this is done in almost exactly the same way as the *t* test but using the `wilcox.test()` function.
```{r ss4, echo=TRUE, collapse=TRUE}
wilcox.test(trees$Height, mu = 70)
```
Don’t worry too much about the warning message, R is just letting you know that your sample contained a number of values which were the same and therefore it was not possible to calculate an exact *p* value. This is only really a problem with small sample sizes. You can also use the arguments `alternative = "greater"` and `alternative = "less"`.
In our one sample test it's always a good idea to examine your data for departures from normality, rather than just assuming everything is OK. Perhaps the simplest way to assess normality is the ‘quantile-quantile plot’. This graph plots the ranked sample quantiles from your distribution against a similar number of ranked quantiles taken from a normal distribution. If your data are normally distributed then the plot of your data points will be in a straight line. Departures from normality will show up as a curve or s-shape in your data points. Judging just how much departure is acceptable comes with a little bit of practice.
To construct a Q-Q plot you need to use both the `qqnorm()`\index{qqnorm()} and `qqline()`\index{qqline()} functions. The `lty = 2` argument changes the line to a dashed line.
```{r ss5, echo=TRUE, collapse=TRUE}
qqnorm(trees$Height)
qqline(trees$Height, lty = 2)
```
If you insist on performing a specific test for normality you can use the function `shapiro.test()`\index{shaprio.test()} which performs a Shapiro – Wilks test of normality.
```{r ss6, echo=TRUE, collapse=TRUE}
shapiro.test(trees$Height)
```
In the example above, the *p* value = 0.4 which suggests that there is no evidence to reject the null hypothesis and we can therefore assume these data are normally distributed.
In addition to one-sample tests, both the `t.test()` and `wilcox.test()` functions can be used to test for differences between two samples. A two sample *t* test is used to test the null hypothesis that the two samples come from distributions with the same mean (i.e. the means are not different). For example, a study was conducted to test whether ‘seeding’ clouds with dimethylsulphate alters the moisture content of clouds. Ten random clouds were ‘seeded’ with a further ten ‘unseeded’. The dataset can be found in the `atmosphere.txt` data file located in the `data/` directory on the Rbook github page.
```{r ss7, echo=TRUE, collapse=TRUE}
atmos <- read.table('data/atmosphere.txt', header = TRUE)
str(atmos)
```
As with our previous data frame (`flowers`), these data are in the long format. The column `moisture` contains the moisture content measured in each cloud and the column `treatment` identifies whether the cloud was `seeded` or `unseeded`. To perform a two-sample *t* test
```{r ss8, echo=TRUE, collapse=TRUE}
t.test(atmos$moisture ~ atmos$treatment)
```
Notice the use of the formula method (`atmos$moisture ~ atmos$treatment`, which can be read as 'the moisture described by treatment') to specify the test. You can also use other methods depending on the format of the data frame. Use `?t.test` for further details. The details of the output are similar to the one-sample *t* test. The Welch’s variant of the *t* test is used by default and does not assume that the variances of the two samples are equal. If you are sure the variances in the two samples are the same, you can specify this using the `var.equal = TRUE` argument
```{r ss9, echo=TRUE, collapse=TRUE}
t.test(atmos$moisture ~ atmos$treatment, var.equal = TRUE)
```
To test whether the assumption of equal variances is valid you can perform an *F*-test on the ratio of the group variances using the `var.test()`\index{var.test()} function.
```{r ss10, echo=TRUE, collapse=TRUE}
var.test(atmos$moisture ~ atmos$treatment)
```
As the *p* value is greater than 0.05, there is no evidence to suggest that the variances are unequal. Note however, that the *F*-test is sensitive to departures from normality and should not be used with data which is not normal. See the `car` package for alternatives.
The non-parametric two-sample Wilcoxon test (also known as a Mann-Whitney U test) can be performed using the same formula method:
```{r ss11, echo=TRUE, collapse=TRUE}
wilcox.test(atmos$moisture ~ atmos$treatment)
```
You can also use the `t.test()` and `wilcox.test()` functions to test paired data. Paired data are where there are two measurements on the same experimental unit (individual, site etc) and essentially tests for whether the mean difference between the paired observations is equal to zero. For example, the `pollution` dataset gives the biodiversity score of aquatic invertebrates collected using kick samples in 17 different rivers. These data are paired because two samples were taken on each river, one upstream of a paper mill and one downstream.
```{r ss12, echo=TRUE, collapse=TRUE}
pollution <- read.table('data/pollution.txt', header = TRUE)
str(pollution)
```
Note, in this case these data are in the wide format with upstream and downstream values in separate columns (see [Chapter 3](#reshape) on how to convert to long format if you want). To conduct a paired *t* test use the `paired = TRUE` argument.
```{r ss13, echo=TRUE, collapse=TRUE}
t.test(pollution$down, pollution$up, paired = TRUE)
```
The output is almost identical to that of a one-sample *t* test. It is also possible to perform a non-parametric matched-pairs Wilcoxon test in the same way
```{r ss14, echo=TRUE, collapse=TRUE}
wilcox.test(pollution$down, pollution$up, paired = TRUE)
```
The function `prop.test()`\index{prop.test()} can be used to compare two or more proportions. For example, a company wishes to test the effectiveness of an advertising campaign for a particular brand of cat food. The company commissions two polls, one before the advertising campaign and one after, with each poll asking cat owners whether they would buy this brand of cat food. The results are given in the table below
| | before | after |
|:--------------:|:-----------:|:-----------:|
| would buy | 45 | 71 |
| would not buy | 35 | 32 |
From the table above we can conclude that 56% of cat owners would buy the cat food before the campaign compared to 69% after. But, has the advertising campaign been a success?
The `prop.test()` function has two main arguments which are given as two vectors. The first vector contains the number of positive outcomes and the second vector the total numbers for each group. So to perform the test we first need to define these vectors
```{r ss15, echo=TRUE, collapse=TRUE}
buy <- c(45,71) # creates a vector of positive outcomes
total <-c((45 + 35), (71 + 32)) # creates a vector of total numbers
prop.test(buy, total) # perform the test
```
There is no evidence to support that the advertising campaign has changed cat owners opinions of the cat food (*p* = 0.1). Use `?prop.test` to explore additional uses of the binomial proportions test.
We could also analyse the count data in the above example as a Chi-square contingency table. The simplest method is to convert the tabulated table into a 2 x 2 matrix using the `matrix()`\index{matrix()} function (note, there are many alternative methods of constructing a table like this).
```{r ss16, echo=TRUE, collapse=TRUE}
buyers <- matrix(c(45, 35, 71, 32), nrow = 2)
buyers
```
Notice that you enter the data column wise into the matrix and then specify the number of rows using `nrow =`.
We can also change the row names and column names from the defaults to make it look more like a table (you don’t really need to do this to perform a Chi-square test)
```{r ss17, echo=TRUE, collapse=TRUE}
colnames(buyers) <- c("before", "after")
rownames(buyers) <- c("buy", "notbuy")
buyers
```
You can then perform a Chi-square test to test whether the number of cat owners buying the cat food is independent of the advertising campaign using the `chisq.test()`\index{chisq.test()} function. In this example the only argument is our matrix of counts.
```{r ss18, echo=TRUE, collapse=TRUE}
chisq.test(buyers)
```
There is no evidence (*p* = 0.107) to suggest that we should reject the null hypothesis that the number of cat owners buying the cat food is independent of the advertising campaign. You may have spotted that for a 2x2 table, this test is exactly equivalent to the `prop.test()`. You can also use the `chisq.test()` function on raw (untabulated) data and to test for goodness of fit (see `?chisq.test` for more details).
## Correlation
In R, the Pearson’s product-moment correlation coefficient between two continuous variables can be estimated using the `cor()`\index{cor()} function. Using the `trees` data set again, we can determine the correlation coefficient of the association between tree `Height` and `Volume`.
```{r ss19, echo=TRUE, collapse=TRUE}
data(trees)
str(trees)
cor(trees$Height, trees$Volume)
```
or we can produce a matrix of correlation coefficients for all variables in a data frame
```{r ss20, echo=TRUE, collapse=TRUE}
cor(trees)
```
Note that the correlation coefficients are identical in each half of the matrix. Also, be aware that, although a matrix of coefficients can be useful, a little commonsense should be used when using `cor()` on data frames with numerous variables. It is not good practice to trawl through these types of matrices in the hope of finding large coefficients without having an *a priori* reason for doing so and remember the correlation coefficient assumes that associations are linear.
If you have missing values in the variables you are trying to correlate, `cor()` will return an error message (as will many functions in R). You will either have to remove these observations (be very careful if you do this) or tell R what to do when an observation is missing. A useful argument you can use with the `cor()` function is `use = "complete.obs"`.
```{r ss21, echo=TRUE, collapse=TRUE}
cor(trees, use = "complete.obs")
```
The function `cor()` will return the correlation coefficient of two variables, but gives no indication whether the coefficient is significantly different from zero. To do this you need to use the function `cor.test()`\index{cor.test()}.
```{r ss22, echo=TRUE, collapse=TRUE}
cor.test(trees$Height, trees$Volume)
```
Two non-parametric equivalents to Pearson correlation are available within the `cor.test()` function; Spearman’s rank and Kendall’s tau coefficient. To use either of these simply include the argument `method = "spearman"` or `method = "kendall"` depending on the test you wish to use. For example
```{r ss23, echo=TRUE, collapse=TRUE}
cor.test(trees$Height, trees$Volume, method = "spearman")
```
## Simple linear modelling {#simple_lm}
Linear models are one of the most widely used models in statistics and data science. They are often thought of as simple models but they're very flexible and able to model a wide variety of experimental and survey designs. Many of the statistical approaches you may have used previously (such as linear regression, *t*-test, ANOVA, ANCOVA etc) can be expressed as a linear model so the good news is that you're probably already familiar with linear models (albeit indirectly). They also form the foundation of more complicated modelling approaches and are relatively easy to extended to incorporate additional complexity. During this section we'll learn how to fit some simple linear models using R and cover some of the more common applications. We won't go into any detail of the underlying linear modelling theory but rather focus on the practicalities of model fitting and R code.
The main function for fitting linear models in R is the `lm()`\index{lm()} function (short for linear model!). The `lm()` function has many arguments but the most important is the first argument which specifies the model you want to fit using a *model formula* which typically takes the general form:
\
>*response variable ~ explanatory variable(s)*
\
This model formula is simply read as
\
>*'variation in the response variable modelled as a function (~) of the explanatory variable(s)'.*
\
The response variable is also commonly known as the 'dependent variable' and the explanatory variables are sometimes referred to as 'independent variables' (or less frequently as 'predictor variables'). There is also an additional term in our model formula which represents the variation in our response variable **not** explained by our explanatory variables but you don't need to specify this when using the `lm()` function.
As mentioned above, many of the statistical 'tests' you might have previously used can be expressed as a linear model. For example, if we wanted to perform a bivariate linear regression between a response variable (`y`) and a single continuous explanatory variable (`x`) our model formula would simply be
```
y ~ x
```
On the other hand, if we wanted to use an ANOVA to test whether the group means of a response variable (`y`) were different between a three level factor (`x`) our model formula would look like
```
y ~ x
```
OK, hang on, they both look identical, what gives? In addition to the model formula, the type of linear model you fit is also determined by the type of data in your explanatory variable(s) (i.e. what class of data). If your explanatory variable is continuous then you will fit a bivariate linear regression. If your explanatory variable is a factor (i.e. categorical data) you will fit an ANOVA type model.
You can also increase the complexity of your linear model by including additional explanatory variables in your model formula. For example, if we wanted to fit a two-way ANOVA both of our explanatory variables `x` and `z` would need to be factors and separated by a `+` symbol
```
y ~ x + z
```
If we wanted to perform a factorial ANOVA to identify an interaction between both explanatory variables we would separate our explanatory variables with a `:` symbol whilst also including our main effects in our model formula
```
y ~ x + z + x:z
```
or by using the equivalent shortcut notation
```
y ~ x * z
```
It's important that you get comfortable with using model formula (and we've only given the briefest of explanations above) when using the `lm()` function (and other functions) as it's remarkably easy to specifiy a model which is either nonsense or isn't the model you really wanted to fit. A summary table of various linear model formula and equivalent R code given below.
\
|Traditional name | Model formula | R code |
|:--------------------------|:---------------------------|:---------------------------|
|Bivariate regression | Y ~ X1 (continuous) | `lm(Y ~ X)` |
|One-way ANOVA | Y ~ X1 (categorical) | `lm(Y ~ X)` |
|Two-way ANOVA | Y ~ X1 (cat) + X2(cat) | `lm(Y ~ X1 + X2)` |
|ANCOVA | Y ~ X1 (cat) + X2(cont) | `lm(Y ~ X1 + X2)` |
|Multiple regression | Y ~ X1 (cont) + X2(cont) | `lm(Y ~ X1 + X2)` |
|Factorial ANOVA | Y ~ X1 (cat) * X2(cat) | `lm(Y ~ X1 * X2)` or `lm(Y ~ X1 + X2 + X1:X2)`|
\
OK, time for an example. The data file `smoking.txt` summarises the results of a study investigating the possible relationship between mortality rate and smoking across 25 occupational groups in the UK. The variable `occupational.group` specifies the different occupational groups studied, the `risk.group` variable indicates the relative risk to lung disease for the various occupational groups and `smoking` is an index of the average number of cigarettes smoked each day (relative to the number smoked across all occupations). The variable `mortality` is an index of the death rate from lung cancer in each group (relative to the death rate across all occupational groups). In this data set, the response variable is `mortality` and the potential explanatory variables are `smoking` which is numeric and `risk.group` which is a three level factor. The first thing to do is import our data file using the `read.table()`\index{read.table()} function as usual and assign the data to an object called `smoke`. You can find a link to download these data [here][flow-data].
```{r ss24, echo=TRUE, eval=TRUE, collapse=TRUE}
smoke <- read.table('data/smoking.txt', header = TRUE, sep = "\t",
stringsAsFactors = TRUE)
str(smoke, vec.len = 2)
# vec.len argument to limited number of ' first elements' to display
```
\
Next, let's investigate the relationship between the `mortality` and `smoking` variables by plotting a scatter plot. We can use either the `ggplot2` package or base R graphics to do this. We'll use `ggplot2` this time and our old friend the `ggplot()`\index{ggplot()} function.
```{r ss25, echo=TRUE, collapse=TRUE}
library(ggplot2)
ggplot(mapping = aes(x = smoking, y = mortality), data = smoke) +
geom_point()
```
The plot does suggest that there is a positive relationship between the smoking index and mortality index.
To fit a simple linear model to these data we will use the `lm()` function and include our model formula `mortality ~ smoking` and assign the results to an object called `smoke_lm`.
```{r ss26, echo=TRUE, collapse=TRUE}
smoke_lm <- lm(mortality ~ smoking, data = smoke)
```
Notice that we have not used the `$` notation to specify the variables in our model formula, instead we've used the `data = smoke` argument. Although the `$` notation will work (i.e. `smoke$mortality ~ smoke$smoking`) it will more than likely cause you problems later on and should be avoided. In fact, we would go as far to suggest that if any function has a `data =` argument you should **always** use it. How do you know if a function has a `data =` argument? Just look in the associated help file.
Perhaps somewhat confusingly (at least at first) it appears that nothing much has happened, you don’t automatically get the voluminous output that you normally get with other statistical packages. In fact, what R does, is store the output of the analysis in what is known as a `lm` class object (which we have called `smoke_lm`) from which you are able to extract exactly what you want using other functions. If you're brave, you can examine the structure of the `smoke_lm` model object using the `str()`\index{str()} function.
```{r ss27, echo=TRUE, collapse=TRUE}
str(smoke_lm)
```
To obtain a summary of our analysis we can use the `summary()`\index{summary()} function on our `smoke_lm` model object.
```{r ss28, echo=TRUE, collapse=TRUE}
summary(smoke_lm)
```
This shows you everything you need to know about the parameter estimates (intercept and slope), their standard errors and associated *t* statistics and *p* values. The estimate for the Intercept suggests that when the relative smoking index is 0 the relative mortality rate is `-2.885`! The *p* value associated with the intercept tests the null hypothesis that the intercept is equal to zero. As the *p* value is large we fail to reject this null hypothesis. The `smoking` parameter estimate (`1.0875`) is the estimate of the slope and suggests that for every unit increase in the average number of cigarettes smoked each day the mortality risk index increases by 1.0875. The *p* value associated with the `smoking` parameter tests whether the slope of this relationship is equal to zero (i.e. no relationship). As our *p* value is small we reject this null hypothesis and therefore the slope is different from zero and therefore there is a significant relationship. The summary table also includes other important information such as the coefficient of determination (*R^2^*), adjusted *R^2^* , *F* statistic, associated degrees of freedom and *p* value. This information is a condensed form of an ANOVA table which you can see by using the `anova()`\index{anova()} function.
```{r ss29, echo=TRUE, collapse=TRUE}
anova(smoke_lm)
```
Now let's fit another linear model, but this time we will use the `risk.group` variable as our explanatory variable. Remember the `risk.group` variable is a factor and so our linear model will be equivalent to an ANOVA type analysis. We will be testing the null hypothesis that there is no difference in the mean mortality rate between the `low`, `medium` and `high` groups. We fit the model in exactly the same way as before.
```{r ss30, echo=TRUE, collapse=TRUE}
smoke_risk_lm <- lm(mortality ~ risk.group, data = smoke)
```
Again, we can produce an ANOVA table using the `anova()` function
```{r ss31, echo=TRUE, collapse=TRUE}
anova(smoke_risk_lm)
```
The results presented in the ANOVA table suggest that we can reject the null hypothesis (very small *p* value) and therefore the mean mortality rate index is different between `low`, `medium` and `high` risk groups.
As we did with our first linear model we can also produce a summary of the estimated parameters using the `summary()`\index{summary()} function.
```{r ss32, echo=TRUE, collapse=TRUE}
summary(smoke_risk_lm)
```
In the summary table the Intercept is set to the first level of `risk.group` (`high`) as this occurs first alphabetically. Therefore, the estimated mean mortality index for `high` risk individuals is `135`. The estimates for `risk.grouplow` and `risk.groupmedium` are mean differences from the intercept (`high` group). So the mortality index for the `low` group is `135 - 57.83 = 77.17` and for the `medium` group is `135 - 27.55 = 107.45`. The *t* values and *p* values in the summary table are associated with testing specific hypotheses. The *p* value associated with the intercept tests the null hypothesis that the mean mortality index for the `high` group is equal to zero. To be honest this is not a particularly meaningful hypothesis to test but we can reject it anyway as we have a very small *p* value. The *p* value for the `risk.grouplow` parameter tests the null hypothesis that the mean difference between `high` and `low` risk groups is equal to zero (i.e. there is no difference). Again we reject this null hypothesis and conclude that the means are different between these two groups. Similarly, the *p* value for `risk.groupmedium` tests the null hypothesis that the mean difference between `high` and `medium` groups is equal to zero which we also reject.
Don't worry too much if you find the output from the `summary()` function a little confusing. Its takes a bit of practice and experience to be able to make sense of all the numbers. Remember though, the more complicated your model is, the more complicated your interpretion will be. And always remember, a model that you can't interpret is not worth fitting (most of the time!).
Another approach to interpreting your model output is to plot a graph of your data and then add the fitted model to this plot. Let's go back to the first linear model we fitted (`smoke_lm`). We can add the fitted line to our previous plot using the `ggplot2` package and the `geom_smooth` geom. We can easily include the standard errors by specifying the `se = TRUE` argument.
```{r ss33, echo=TRUE, collapse=TRUE, message=FALSE}
ggplot(mapping = aes(x = smoking, y = mortality), data = smoke) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)
```
You can also do this with R's base graphics. Note though that the fitted line extends beyond the data which is not great practice. If you want to prevent this you can generate predicted values from the model using the `predict()`\index{predict()} function within the range of your data and then add these values to the plot using the `lines()`\index{lines()} function (not shown).
```{r ss34, echo=TRUE, collapse=TRUE}
plot(smoke$smoking, smoke$mortality, xlab = "smoking rate", ylab = " mortality rate")
abline(smoke_lm, col = "red")
```
Before we sit back and relax and admire our model (or go write that high impact paper your supervisor/boss has been harassing you about) our work is not finished. It's vitally important to check the underlying assumptions of your linear model. Two of the most important assumption are equal variances (homogeneity of variance) and normality of residuals. To check for equal variances we can construct a graph of residuals versus fitted values. We can do this by first extracting the residuals and fitted values from our model object using the `resid()`\index{resid()} and `fitted()`\index{fitted()} functions.
```{r ss35, echo=TRUE, collapse=TRUE}
smoke_res <- resid(smoke_lm)
smoke_fit <- fitted(smoke_lm)
```
And then plot them using `ggplot` or base R graphics.
```{r ss36, echo=TRUE, collapse=TRUE}
ggplot(mapping = aes(x = smoke_fit, y = smoke_res)) +
geom_point() +
geom_hline(yintercept = 0, colour = "red", linetype = "dashed")
```
It takes a little practice to interpret these types of graph, but what you are looking for is no pattern or structure in your residuals. What you definitely don’t want to see is the scatter increasing around the zero line (red dashed line) as the fitted values get bigger (this has been described as looking like a trumpet, a wedge of cheese or even a slice of pizza) which would indicate unequal variances (heteroscedacity).
To check for normality of residuals we can use our old friend the Q-Q plot using the residuals stored in the `smoke_res` object we created earlier.
```{r ss37, echo=TRUE, collapse=TRUE}
ggplot(mapping = aes(sample = smoke_res)) +
stat_qq() +
stat_qq_line()
```
Or the same plot with base graphics.
```{r ss38, echo=TRUE, collapse=TRUE}
qqnorm(smoke_res)
qqline(smoke_res)
```
Alternatively, you can get R to do most of the hard work by using the `plot()`\index{plot()} function on the model object `smoke_lm`. Before we do this we should tell R that we want to plot four graphs in the same plotting window in RStudio using the `par(mfrow = c(2,2))`. This command splits the plotting window into 2 rows and 2 columns.
```{r ss39, echo=TRUE, collapse=TRUE}
par(mfrow = c(2,2))
plot(smoke_lm)
```
The first two graphs (top left and top right) are the same residual versus fitted and Q-Q plots we produced before. The third graph (bottom left) is the same as the first but plotted on a different scale (the absolute value of the square root of the standardised residuals) and again you are looking for no pattern or structure in the data points. The fourth graph (bottom right) gives you an indication whether any of your observations are having a large influence (Cook’s distance) on your regression coefficient estimates. Levearge identifies observations which have unusually large values in their explanatory variables.
You can also produce these diagnostic plots using `ggplot` by installing the package `ggfortify`\index{ggfortify package} and using the `autoplot()`\index{autoplot()} function.
```{r ss40, echo=TRUE, collapse=TRUE, message=FALSE, warning=FALSE}
library(ggfortify)
autoplot(smoke_lm, which = 1:6, ncol = 2, label.size = 3)
```
What you do about influential data points or data points with high leverage is up to you. If you would like to examine the effect of removing one of these points on the parameter estimates you can use the `update()`\index{update()} function. Let's remove data point 2 (miners, mortality = 116 and smoking = 137) and store the results in a new object called `smoke_lm2`. Note, we do this to demonstrate the use of the `update()` function. You should think long and hard about removing any data point(s) and if you do you should **always** report this and justify your reasoning.
```{r ss41, echo=TRUE, collapse=TRUE}
smoke_lm2 <- update(smoke_lm, subset = -2)
summary(smoke_lm2)
```
There are numerous other functions which are useful for producing diagnostic plots. For example, `rstandard()`\index{rstandard()} and `rstudent()`\index{rstudent()} returns the standardised and studentised residuals. The function `dffits()`\index{dffits()} expresses how much an observation influences the associated fitted value and the function `dfbetas()`\index{dfbetas()} gives the change in the estimated parameters if an observation is excluded, relative to its standard error (intercept is the solid line and slope is the dashed line in the example below). The solid bold line in the same graph represents the Cook’s distance. Examples of how to use these functions are given below.
```{r ss42, echo=TRUE, collapse=TRUE}
par(mfrow=c(2,2))
plot(dffits(smoke_lm), type = "l")
plot(rstudent(smoke_lm))
matplot(dfbetas(smoke_lm), type = "l", col = "black")
lines(sqrt(cooks.distance(smoke_lm)), lwd = 2)
```
## Other modelling approaches
As with most things R related, a complete description of the variety and flexibility of different statistical analyses you can perform is beyond the scope of this introductory text. Further information can be found in any of the excellent documents referred to in [Chapter 2](#rhelp). A table of some of the more common statistical functions is given below to get you started. \index{glm()} \index{gam()} \index{lme()} \index{lmer()} \index{gls()} \index{krustkal.test()} \index{friedman.test()}
\
| R function | Use |
|:---------------------|:-----------------------------------------------------------------------|
| `glm()` | Fit a generalised linear model with a specific error structure specified using the `family =` argument (Poisson, binomial, gamma) |
| `gam()` | Fit a generalised additive model. The R package `mgcv` must be loaded |
| `lme()` & `nlme()` | Fit linear and non-linear mixed effects models. The R package `nlme` must be loaded |
| `lmer()` | Fit linear and generalised linear and non-linear mixed effects models. |
| | The package `lme4` must be **installed** and loaded |
| `gls()` | Fit generalised least squares models. The R package `nlme` must be loaded |
| `kruskal.test()` | Performs a Kruskal-Wallis rank sum test |
| `friedman.test()` | Performs a Friedman’s test |
\
## Exercise 6
```{block2, note-text6, type='rmdtip'}
Congratulations, you've reached the end of Chapter 6! Perhaps now's a good time to practice some of what you've learned. You can find an exercise we've prepared for you (and our solutions) on the course website.
```
```{r links, child="links.md"}
```