forked from andreamazzella/r4asme
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathr4asme11 correlated.Rmd
560 lines (408 loc) · 18.7 KB
/
r4asme11 correlated.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
---
title: "11: Analysis of correlated data"
subtitle: "R 4 ASME"
author: Author – Andrea Mazzella [(GitHub)](https://github.com/andreamazzella)
output: html_notebook
---
-------------------------------------------------------------------------------
## Contents
* Analysis of correlated data with Poisson and logistic regression
* robust standard errors
* Generalised Estimating Equations
* Random Effects models
-------------------------------------------------------------------------------
## 0. Packages and options
```{r message=FALSE, warning=FALSE}
# Load packages
library("haven")
library("magrittr")
library("summarytools")
library("survival")
library("miceadds") # robust standard errors
library("lme4") # RE
library("broom") # improve regression output
library("broom.mixed") # improve regression output for lme4
library("geepack") # GEE
library("tidyverse")
# Limit significant digits to 3, reduce scientific notation
options(digits = 3, scipen = 9)
```
-------------------------------------------------------------------------------
# Part 1: Poisson regression
These data are from a pneumococcal vaccine trial performed in Papua New Guinea, assessing the vaccine efficacy in preventing clinical episodes of pneumonia among children.
Each child might have more than one record, because each record represents an episode of pneumonia (or the last period of follow-up, without pneumonia). (This means that the dataset was produced with Stata's `stset` command).
```{r include=FALSE}
papua <- read_dta("pngnew.dta")
glimpse(papua)
```
Variables that we will use:
**Outcome*: `any` (episode of clinical pneumonia during this period)
**Exposure*: `vacc` (vaccination: 1 = placebo, 2 = vaccine)
**Cluster*: `id` (child)
**Time*:
-`timein` (date of entry in this follow-up period)
-`timeout` (date of exit from this follow-up period)
-`dob` (date of birth)
**Other*
- `sex` (1 = male, 2 = female)
- `anyprev` (0: no previous episodes of pneumonia, 1: any prev. episodes)
Label values. The Stata practical will ask you to calculate person-years; let's do it now.
```{r data_management}
papua %<>% mutate(sex = factor(sex, levels = c(1, 2), labels = c("male", "female")),
vacc = factor(vacc, levels = c(1, 2), labels = c("placebo", "vaccine")),
pyrs = as.numeric(timeout - timein) / 365.25) %>%
select(-"datevac")
summary(papua)
```
## 1. Explore the data format
Identify all records for child with ID 2921 and make sure you understand what each row represents.
```{r}
papua %>% filter(id == 2921)
```
## 2. Explore the numbers
```{r}
pap_summ <- papua %>% group_by(id) %>% summarise(episodes = sum(any),
vacc = max(as.numeric(vacc)))
pap_summ %$% ctable(episodes, vacc, headings = F, prop = "no")
```
A total of 1390 children, 671 of whom are vaccinated. 467 of them did not have any episodes of pneumonia. Two children had 11 episodes of pneumonia, and they were both unvaccinated.
Count the total number of episodes in each intervention arm.
```{r}
papua %>% group_by(vacc) %>% summarise(episodes = sum(any))
```
## 3. Prepare for cohort analysis
Create a survival object to do person-time calculations. Unlike `stset`, `Surv()` doesn't seem to require any special option to include repeated observations.
```{r}
#surv_papua <- papua %$% Surv(time = as.numeric(timein) / 365.25,
# time2 = as.numeric(timeout) / 365.25,
# event = any)
surv_papua <- papua %$% Surv(time = pyrs, event = any)
summary(surv_papua)
```
## 4. Incidence rates and HR (invalid)
Calculate incidence rates in the vaccine and placebo arms, and calculate a rate ratio (without accounting for within-child clustering).
```{r}
pyears(surv_papua ~ vacc, data = papua, scale = 1) %>%
summary(n = F, rate = T, ci.r = T, scale = 100)
print("Rate ratio")
89.5/99.0
```
*Issue* I don't know how to calculate the HR automatically, or how to get a 95% CI, or a p-value.
From this (incorrect) analysis, the vaccine has a (mild) effect. (NB: from `stmh`: p = 0.02)
## 5a. Poisson ignoring clustering (invalid)
Let's demonstrate that ordinary Poisson regression, ignoring clustering, is also invalid.
Fit an ordinary Poisson regression model for the effect of vaccination.
`broom::tidy()` can be used to improve the output of a `glm()` – it's an alternative to `epiDisplay::idr.display()`.
```{r}
# Poisson model
pois_inv <- glm(any ~ vacc + offset(log(pyrs)),
family = "poisson",
data = papua)
# HR and 95% CI
tidy(pois_inv, exponentiate = T, conf.int = T)
?tidy.glm
# log(SE)
summary(pois_inv)
```
HR 0.9 (0.83, 0.98), Wald's p = 0.02
HR is the same as the one calculated with `pyears()`.
SE of log(HR) = 0.04235
## 5b. Poisson with robust standard errors
Now fit again the Poisson model, but take clustering into account by computing robust standard errors. How does this change the estimations of the vaccine efficacy?
We can use function `glm.cluster()` from package {miceadds}. It's similar to `glm()`, and it includes a cluster option (careful – variable needs to be put in quotes)
```{r}
pois_rob <- glm.cluster(papua,
any ~ vacc + offset(log(pyrs)),
cluster = "id",
family = "poisson")
# HR
coef(pois_rob) %>% exp()
# 95% CI
confint(pois_rob) %>% exp()
# SE of log(HR)
summary(pois_rob)
```
HR 0.90: same as above
95% CI: 0.80-1.02 (wider)
SE of log(HR): 0.061
p = 0.10: only weak evidence for an association now.
## 6. Poisson with random effects
Now use a random effects model to account for within-child clustering.
We use `lme4::glmer()`. The overall syntax is the same as `glm()`; you specify the clustering factor like this: `+ (1|clustering_factor)`. (Statistically, this is a "scalar random effect term": it generates 1 random effect per cluster).
The output can be much simplified with `broom.mixed::tidy()`.
```{r}
# Fit model
pois_re <- glmer(any ~ vacc + offset(log(pyrs)) + (1|id),
data = papua,
family = "poisson")
# Output
tidy(pois_re,
conf.int = TRUE,
exponentiate = TRUE,
effects = "fixed")
```
HR: 0.89 (0.78-1.01) (p = 0.06)
SE: 0.057
*Minor issue* Stata also provides a measure of clustering, theta, and a LRT on the evidence of clustering. Not sure how to extract the theta from the `glmer()` output.
(Stata - theta: 0.76, LRT: p < 0.001)
## 7. Vaccine efficacy
In Stata, this is `xtpoisson` instead of `streg, shared(cluster)`. `xtpoisson` estimates a parameter α, whilst `streg` estimates a parameter θ (but their values are the same).
What is the estimated vaccine efficacy, with an appropriate 95% CI?
```{r}
# Values from regression model
rate_ratio <- 0.886
rr_conf_low <- 0.780
rr_conf_high <- 1.006
# Calculate VE
(vaccine_efficacy <- (1 - rate_ratio) * 100)
ve_conf_low <- (1 - rr_conf_high) * 100
(if (ve_conf_low > 0) ve_conf_low else ve_conf_low <- 0)
(ve_conf_high <- (1 - rr_conf_low) * 100)
```
## 8. Age time-scale
Now set the time-scale as "age", and refit the Poisson model with random effects.
```{r}
surv_papua_age <- papua %$% Surv(time = as.numeric(timein) / 365.25,
time2 = as.numeric(timeout) / 365.25,
event = any,
origin = as.numeric(dob) / 365.25)
```
*-----*
*ISSUE* How do I use the new surv object in the `glmer()` function?
*-----* Do I need to offset for something other than pyrs?
```{r}
# Fit model
pois_re_age <- glmer(any ~ vacc + offset(log(pyrs)) + (1|id),
data = papua,
family = "poisson")
# Output
tidy(pois_re_age,
conf.int = TRUE,
exponentiate = TRUE,
effects = "fixed")
```
>Stata
```{stata}
streg i.vacc, dist(exp) frailty(gamma) shared(id) forceshared
```
## 9. Random effects Poisson model with covariates
Using the survival object with follow-up timescale, fit a random effects model to assess vaccine efficacy controlling for age at the start of each period, and sex.
Let's first calculate age at start, and divide it into categories.
```{r}
# Calculate age at start
papua %<>% mutate(age_start = as.numeric(timein - dob)/365.25)
# Summarise age at start
summary(papua$age_start)
ggplot(papua, aes(age_start)) + geom_histogram()
```
The age at start range from 0.21 to 7.44 years. In the Stata practical, they decide to categorise this in 1-year groups, except the children aged 4 years or more, who are grouped togther.
```{r}
# Categorise age at start
papua %<>% mutate(agegrp = cut(age_start,
breaks = c(0, 1, 2, 3, 4,+Inf),
labels = c("0-1yr,", "1-2yr", "2-3yr", "3-4yr", ">=4yr")))
# Check it worked
papua %>% group_by(agegrp) %>% summarise(count = n(),
min_age = min(age_start),
max_age = max(age_start))
```
We can now fit the RE model with sex and age as (categorical) covariates.
Does this change the estimates compared with Q6? If so, why?
```{r}
# Fit model
pois_re_age <- glmer(any ~ vacc + agegrp + sex + offset(log(pyrs)) + (1|id),
data = papua,
family = "poisson")
# Output
tidy(pois_re_age,
conf.int = TRUE,
exponentiate = TRUE,
effects = "fixed")
```
Effect of vaccine vs placebo:
Adjusted: HR 0.91 (0.82-1.02), p = 0.10
Crude (Q6): HR 0.89 (0.78-1.01), p = 0.06
Controlling for these baseline factors hasn't changed much the estimates – which makes sense, because this is a randomised trial. You can however still check for baseline differences if you want:
```{r}
papua %>% filter(anyprev == 0) %$% ctable(sex, vacc, prop = "c")
papua %>% filter(anyprev == 1) %$% ctable(agegrp, vacc, prop = "c")
```
-------------------------------------------------------------------------------
# Part 2: Logistic regression
These data are from a study on household contacts of tubercolosis cases. We will assess the effect of duration of cough in the index case on the odds of positive Mantoux test in the contacts. Since there is no time element, we will use logistic regression.
## 10. Import and explore data
```{r include=FALSE}
tb <- read_dta("hhtb.dta") %>% mutate_if(is.labelled, as_factor)
glimpse(tb)
```
Variables that we will use:
**Outcome*: `mantoux` (tuberculin test result: 0 = negative, 1 = positive)
**Exposure*: `cough` (duration of cough in index case: 1 = <2 months, 2 = >=2 months)
**Cluster*: `id` (household, so = index case)
**Other*
- `hiv` (HIV status of index case: 1 = negative, 2 = positive)
- `agegrp` (age of contact, in years)
There are value labels in Stata, I'm not sure why haven can't import them. Let's add them, and remove variables we don't need. I'll add an "ix" to variables that relate to the index case because I get confused otherwise.
```{r}
tb %<>% mutate(ix_cough = factor(cough, levels = c(1, 2), labels = c("<2mo", ">=2mo")),
ix_hiv = factor(hiv, levels = c(1, 2), labels = c("neg", "pos"))) %>%
select(-c("smear1", "crowding", "intimacy", "tbsite", "cavit", "cough", "hiv"))
summary(tb)
```
## 11. Explore clusters by HIV status of index case
Summarise the dataset by index case to explore the distribution of number of contacts stratified by HIV status of the index case.
```{r}
# Create summary data
tb_summ <- tb %>% group_by(id) %>%
summarise(n_contacts = n(),
index_HIV = mean(as.numeric(ix_hiv)))
# Crosstabulate
tb_summ %$% ctable(n_contacts, index_HIV, prop = "n")
```
The dataset contains a total of 70 index cases, 28 of whom were HIV negative and 42 HIV positive.
8 index cases had only 1 contact in their household; 9 had 2, etc.
## 12. Explore clusters by cough of index case
Do the same but stratifying by duration of cough in the index case.
```{r}
# Create summary data
tb_summ <- tb %>% group_by(id) %>%
summarise(n_contacts = n(),
index_cough = mean(as.numeric(ix_cough)))
# Crosstabulate
tb_summ %$% ctable(n_contacts, index_cough, prop = "n")
```
32 index cases had a cough for 2 months or less, 26 for more than that; for 12, this information had not been recorded.
## 13. Crude analysis ignoring clustering (invalid)
Examine the distribution of positive Mantoux among contacts by the duration of cough in the index case, ignoring any clustering.
What's the OR and 95% CI? χ² p-value? What would you conclude?
```{r}
tb %$% ctable(mantoux, ix_cough, prop = "c", OR = T, chisq = T)
```
Ignosing clustering, 67% of positive people had an index case who had a cough for more than 2 months.
OR 1.78 (1.06-3.01), χ² p = 0.04
With this (invalid) analysis, there is evidence of an association between being a contact of someone with cough of long duration and testing positive for TB.
## 14. Logistic regression with robust standard errors
Fit a logistic regression model that accounts for clustering by calculating robust standard errors.
What conclusions do you derive? Compare this with the above output.
```{r}
# Fit the model
logit_rob <- glm.cluster(tb,
mantoux ~ ix_cough,
cluster = "id",
family = "binomial")
# HR
coef(logit_rob) %>% exp()
# 95% CI
confint(logit_rob) %>% exp()
# SE of log(HR)
summary(logit_rob)
```
OR 1.78 (0.90-3.52), p = 0.10: OR is the same, but CI is wider, and there is less evidence of an association.
## 15. Logistic regression with GEE
Now fit the same logistic model but with Generalised Estimating Equations approach, with robust standard errors and an exchangeable correlation matrix.
Compare the results with the ones above.
We can do this with `geepack::geeglm()`. The correlation matrix type goes into the "corstr" option. The standard errors are automatically calculated as robust.
Note that this is a bit fiddly because the function requires the explanatory variables to not contain any missing data, so you need to `drop_na()` your dataset.
```{r}
# Fit GEE logistic model
? geeglm
tb_gee <- geeglm(mantoux ~ ix_cough,
data = drop_na(tb, ix_cough),
id = id,
family = "binomial",
corstr = "exchangeable")
# Output
tidy(tb_gee,
conf.int = TRUE,
exponentiate = TRUE)
```
OR 1.88 (0.95-3.71)
p = 0.07
## 16. Logistic regression with RE and quadrature check
Now do the same but with a random effects model, and do a LRT of rho = 0. Is there evidence of variation between households?
(*Technical note*: there is now a new option, nAGQ = 12. This is because, by default, glmer() uses the Laplace approximation to the log likelihood (1 point per axis), whilst Stata uses the Gauss-Hermite quadratic approximation, with 12 points per axis; this is why you need to use `quadchk` in Stata with this model, so that you can ensure that this quadrature reasonable. Changing nAGQ slightly changes your CI and p-value. For some reason, in Stata the RE model in Q9 does not use quadrature, so even if you try and use `quadchk` it doesn't let you. Is it because that model has covariates? Or because it's a Poisson model? I don't know.)
*Issue*: as earlier, I don't know how to do a LRT of rho = 0.
```{r}
# Fit model
logit_re_age <- glmer(mantoux ~ ix_cough + (1|id),
data = tb,
family = "binomial",
nAGQ = 12)
# Output
tidy(logit_re_age,
conf.int = TRUE,
exponentiate = TRUE,
effects = "fixed")
```
OR 2.19 (0.94-5.08), p = 0.07.
The OR estimate is higher than that from the GEE analysis.
rho = ? (from Stata: 0.25)
LRT on rho=0, p = ? (from Stata: 0.001)
Now check the reliability of these estimates with a quadrature check.
I haven't found a function that's the exact equivalent to `quadchk()` in Stata. What I can do is fit two more models with different `nAGQ` values, and checking the difference in the estimates yourself. If the relative differences are < 0.01, the estimates from your model are reasonably reliable.
```{r}
# Original model
logit_re_age %>% fixef()
# Model with 8 quadrature points
glmer(mantoux ~ ix_cough + (1|id),
data = tb,
family = "binomial",
nAGQ = 8) %>% fixef()
# Model with 16 quadrature points
glmer(mantoux ~ ix_cough + (1|id),
data = tb,
family = "binomial",
nAGQ = 16) %>% fixef()
```
All the relative differences are less than 0.01, suggesting that the estimates from the RE model are reasonably reliable.
## 17. Logistic regression, RE, with covariates
Fit a random effects logistic model to estimate the odds of positive Mantoux according to cough in the index case, controlling for:
* HIV status of the index case
* age of the household contact
* household clustering
How would you summarise and interpret your results?
```{r}
logit_re_multi <- glmer(mantoux ~ ix_cough + ix_hiv + agegrp + (1|id),
data = tb,
family = "binomial",
nAGQ = 12)
tidy(logit_re_multi,
conf.int = TRUE,
exponentiate = TRUE,
effects = "fixed")
```
After adjusting for index case HIV, age group, and clustering, the odds of positive Mantoux in contacts of an index case with a cough lasting more than 2 months are 1.88 times higher than in contacts of indec cases with shorter cough duration. (CI 0.80-4.41). There is not good evidence for this association (p = 0.14)
Estimates from Stata are slightly different (OR 1.85, 0.79-4.31, p = 0.15)
## 18. Logistic regression, GEE, with covariates
Do the same as Q17 but with GEE, and compare the results.
```{r}
logit_gee_covar <- geeglm(mantoux ~ ix_cough + ix_hiv + factor(agegrp),
data = drop_na(tb, ix_cough),
id = id,
family = "binomial",
corstr = "exchangeable")
# Output
tidy(logit_gee_covar,
conf.int = TRUE,
exponentiate = TRUE)
```
OR 1.64 (0.82, 3.30)
[OR 1.64 (0.81, 3.32) in Stata]
As GEEs are not based on likelihood, we can't use LRTs. Instad, we can use Wald tests of simple and composite linear hypotheses. The equivalent to Stata's `testparm` is using `anova()` to compare a model with age and a model without age.
What hypotheses are being tested? What do you conclude?
```{r}
# Model without age
logit_gee_covar_2 <- geeglm(mantoux ~ ix_cough + ix_hiv,
data = drop_na(tb, ix_cough),
id = id,
family = "binomial",
corstr = "exchangeable")
# ANOVA
anova(logit_gee_covar, logit_gee_covar_2)
```
Result: χ² = 31, p < 0.001
[Stata] χ² = 30.4, p < 0.001
H0: there is no association between age and TB, after adjusting for the index case's HIV status and cough duration.
H1: there is such an association
There is very strong evidence against no association.
-------------------------------------------------------------------------------