This repository has been archived by the owner on Sep 15, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
/
r4asme07 poisson.Rmd
300 lines (226 loc) · 9.84 KB
/
r4asme07 poisson.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: "7: Poisson regression in cohort studies"
subtitle: "R 4 ASME"
authors: Authors – Andrea Mazzella [(GitHub)](https://github.com/andreamazzella) & Lakmal Mudalige
output: html_notebook
---
-------------------------------------------------------------------------------
## Contents
* Poisson regression models
* simple
* adjusting for covariates
* checking for interaction
* checking for linear associations
-------------------------------------------------------------------------------
## 0. Packages and options
```{r message=FALSE, warning=FALSE}
# Load packages
library("haven")
library("magrittr")
library("survival")
library("summarytools")
library("epiDisplay")
library("tidyverse")
# Limit significant digits to 3, reduce scientific notation
options(digits = 3, scipen = 9)
```
-------------------------------------------------------------------------------
## 1. Data import and exploration
Get ondrate.dta in the same folder as this .Rmd notebook, import it, and explore it.
This dataset contains information from a cohort study on ~1500 people living in Nigeria, on the effect of onchocerciasis (river blindness, a parasitic infection) on optic nerve disease.
```{r}
ond <- read_dta("ondrate.dta") %>% mutate_if(is.labelled, as_factor)
ond
```
```{r include=FALSE}
glimpse(ond)
```
```{r}
table(ond$sex, useNA = "ifany")
```
* Outcome: `disc2` (optic nerve disease)
* Exposure: `mfpermg` (microfilariae per mg, a measure of severity of the onchocercal infection)
* Confounders: `age` (at start), `sex`
* Time:
- `pyrs` (person-years, already calculated)
- `start` (date of entry into study)
- `end` (date of exit from study)
Thankfully, this dataset is mostly labelled. We only need to do a tiny bit of data management.
```{r}
ond %<>%
mutate(id = as.factor(id),
disc2 = as.integer(disc2),
sex = as.factor(case_when(sex == 1 ~ "male", sex == 2 ~ "female")))
summary(ond)
```
-------------------------------------------------------------------------------
## 2. Calculation of incidence rates
Calculate incidence rates of Optic Nerve Disease by age, sex, and microfilarial load.
```{r}
# Create survival object
ond_surv <- ond %$% Surv(time = as.numeric(start) / 365.25,
time2 = as.numeric(end) / 365.25,
event = disc2)
# Rates by age
pyears(ond_surv ~ age, ond, scale = 1) %>% summary(n = F, rate = T, ci.r = T, scale = 1000)
# Rates by sex
pyears(ond_surv ~ sex, ond, scale = 1) %>% summary(n = F, rate = T, ci.r = T, scale = 1000)
# Rates by microfilarial load
pyears(ond_surv ~ mfpermg, ond, scale = 1) %>% summary(n = F, rate = T, ci.r = T, scale = 1000)
```
We can see that the incidence increases dramatically with age, it's about the same among females and males, and increases when microfilarial load is above 10.
-------------------------------------------------------------------------------
## 3. Poisson regression
Now perform the same three analyses but with Poisson regression.
Poisson regression in R uses the `glm()` formula – but it also needs to include `offset(log(<pyears>))` as a covariate. This is for mathematical reasons.
{epiDisplay} has a function to simplify the Poisson regression input: `idr.display()`.
```{r}
# Age
glm(disc2 ~ age + offset(log(pyrs)),
family = "poisson",
data = ond) %>% idr.display()
```
You can verify that these rate ratios are the same as those derived by classical analysis (by dividing the rates).
```{r}
# Sex
glm(disc2 ~ sex + offset(log(pyrs)),
family = "poisson",
data = ond) %>% idr.display()
# Microfilarial load
glm(disc2 ~ mfpermg + offset(log(pyrs)),
family = "poisson",
data = ond) %>% idr.display()
```
-------------------------------------------------------------------------------
## 4. Potential confounders
Is either age or sex associated with microfilarial load? Cross-tabulate them and check the percentages. (I'm using a function from summarytools, you can also use table or tabpct)
```{r}
ond %$% ctable(age, mfpermg)
ond %$% ctable(sex, mfpermg)
```
Older people tend to have higher infection markers: age is a potential confounder.
The sex distribution is balanced, so sex is unlikely to be a confounder.
-------------------------------------------------------------------------------
## 5. Poisson regression with covariates
Adjust for age, sex, and both with Poisson regression. Is there any indication of confounding?
```{r}
# Adjusting for age only
glm(disc2 ~ mfpermg + age + offset(log(pyrs)),
family = "poisson",
data = ond) %>% idr.display()
# Adjusting for sex only
glm(disc2 ~ mfpermg + sex + offset(log(pyrs)),
family = "poisson",
data = ond) %>% idr.display()
# Adjusting for both age and sex
pois_full <- glm(disc2 ~ mfpermg + age + sex + offset(log(pyrs)),
family = "poisson",
data = ond)
idr.display(pois_full)
```
The age-adjusted rate ratios for microfilarial loads are reduced: it is possible that age is a confounder. The RRs remain almost the same after adjusting for sex.
-------------------------------------------------------------------------------
## Bonus: Forest plot
Package sjPlot has a great function that creates a Forest plot of results from a regression model quite easily! `plot_model()`, that takes as its main argument a regression model. Very customisable, for more info look at [this vignette](https://cran.r-project.org/web/packages/sjPlot/vignettes/plot_model_estimates.html).
```{r}
sjPlot::plot_model(pois_full,
group.terms = c(1, 1, 2, 2, 2, 3),
show.values = T, value.offset = 0.35,
show.p = F,
title = "Multivariate Poisson regression")
```
-------------------------------------------------------------------------------
## 6. Poisson and Interaction
As logistic regression, Poisson models also assume that there is no interaction, so you need to check this assumption with a likelihood ratio test. You do this exactly like you do with logistic regression.
Check for interaction between microfilarial load and 1. age, 2. sex.
```{r}
# Interaction between MF and age
mf_age_simple <- glm(disc2 ~ mfpermg + age + offset(log(pyrs)),
family = "poisson",
data = ond)
mf_age_intera <- glm(disc2 ~ mfpermg * age + offset(log(pyrs)),
family = "poisson",
data = ond)
lrtest(mf_age_intera, mf_age_simple)
```
No evidence of interaction (however, there are 6 degrees of freedom – test might be underpowered)
```{r}
# Interaction between MF and sex
mf_sex_simple <- glm(disc2 ~ mfpermg + sex + offset(log(pyrs)),
family = "poisson",
data = ond)
mf_sex_intera <- glm(disc2 ~ mfpermg * sex + offset(log(pyrs)),
family = "poisson",
data = ond)
lrtest(mf_sex_intera, mf_sex_simple)
```
No evidence of interaction between MF and sex.
-------------------------------------------------------------------------------
## 7. Poisson and linearity
Assess for a linear relationship with age by treating it as numerical (as opposed to categorical) in a simple Poisson model.
You just need to use `as.integer()` with variable age. Unfortunately doing this within the same code flow as `idr.display()` breaks it, so you have two options:
1. extract the RR and 95% CI manually
2. use as.integer() outside the glm() function
```{r}
# Age as numerical (option 1)
age_linear <- glm(disc2 ~ as.integer(age) + offset(log(pyrs)),
family = "poisson",
data = ond)
# Get HRs by exponentiating the log(HR)s
age_linear$coefficients %>% exp()
confint(age_linear) %>% exp()
# Option 2
# ond$age_int <- as.integer(ond$age)
# glm(disc2 ~ age_int + offset(log(pyrs)), family = "poisson", data = ond) %>% idr.display()
# Age as categorical
age_catego <- glm(disc2 ~ age + offset(log(pyrs)),
family = "poisson",
data = ond)
# LRT
lrtest(age_catego, age_linear)
```
There is no evidence against a linear increase in the log(rate) from one age group to the next, so it's okay to use age as a continuous variable.
-------------------------------------------------------------------------------
## 8. Back to Whitehall
Open whitehal.dta and use Poisson regression to examine the effect of job grade on cardiac mortality, adjusting for ageband and smoking status simultaneously.
```{r include=FALSE}
whitehall <- read_stata("whitehal.dta")
glimpse(whitehall)
```
The whitehall dataset needs massive data management (see session 6) but we'll now do the bare minimum for these Poisson models:
- age bands as a factor variable; (NB: I suspect this is done via an exhausting Lexis expansion in the Stata solutions, but age at entry will do...)
- job grade as a binary factor variable;
- number of person-years for each person;
- smoking status as a 3-level factor variable.
```{r}
# Explore age distribution
whitehall %>% ggplot(aes(agein)) + geom_histogram()
# Categorise age in 5-year bands
whitehall$ageband <- cut(whitehall$agein, seq(40, 70, 5)) %>% as.factor()
# Check it worked ok
summary(whitehall$ageband)
# Factorise job grade
whitehall$grade <- as.factor(whitehall$grade)
# Create person-years of follow up
whitehall$pyrs <- whitehall %$% as.numeric(timeout - timein)
# Group all current smokers together
whitehall$smok3 <- ifelse(whitehall$smok == 1, 1,
ifelse(whitehall$smok == 2, 2,
3)) %>% as.factor()
```
Now we can fit the models.
```{r}
# Model including job grade
with_grade <- glm(chd ~ offset(log(pyrs)) + grade + smok3 + ageband,
family = poisson(),
data = whitehall)
idr.display(with_grade)
# Model without job grade
wout_grade <- glm(chd ~ offset(log(pyrs)) + smok3 + ageband,
family = poisson(),
data = whitehall)
idr.display(wout_grade)
# LRT
lrtest(with_grade, wout_grade)
```
--------------------------------------------------------------------------------