forked from andreamazzella/r4asme
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathr4asme02 logistic.Rmd
336 lines (236 loc) · 9.45 KB
/
r4asme02 logistic.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
---
title: "2: Review of logistic regression"
subtitle: "R 4 ASME"
author: Author – Andrea Mazzella [(GitHub)](https://github.com/andreamazzella)
output: html_notebook
---
-------------------------------------------------------------------------------
## Contents
This is a summary of the four SME topics on logistic regression.
For more material, I have translated the SME practicals, you can access them [here](https://github.com/andreamazzella/R4SME).
-------------------------------------------------------------------------------
## Data management
Before using another .Rmd file or another .R script, I recommend restarting the R session. You can do this by clicking on `Session` in the menu and then selecting `Restart R`; or with the keyboard shortcut `Ctrl + Shift + F10`.
This is to have a "fresh start": removing all objects from the environment and most importantly unloading the libraries of the packages used in the previous .Rmd!
This is preferable to `rm(list = ls())`, which only removes the objects from the environment.
```{r message=FALSE, warning=FALSE}
# Load packages
library("haven")
library("magrittr")
library("summarytools")
library("epitools")
library("pubh")
library("rstatix")
library("tidyverse")
# Limit significant digits to 2, remove scientific notation
options(digits = 2, scipen = 9)
```
```{r}
# Data import
mwanza <- read_dta("mwanza.dta")
# Data tidying
# Recode missing values
mwanza %<>% mutate(
ud = na_if(ud, 9),
rel = na_if(rel, 9),
bld = na_if(bld, 9),
npa = na_if(npa, 9),
pa1 = na_if(pa1, 9),
eth = na_if(eth, 9),
inj = na_if(inj, 9),
msta = na_if(msta, 9),
skin = na_if(skin, 9),
fsex = na_if(fsex, 9),
usedc = na_if(usedc, 9)
)
# Create a vector of categorical variable names
categ <- c("comp", "case", "ed", "eth", "rel", "msta", "bld", "inj", "skin",
"fsex", "npa", "pa1", "usedc", "ud", "ark", "srk", "ed2")
# Make them all categorical
mwanza[categ] <- lapply(mwanza[categ], as.factor)
# Create a new variable, relevel and label
mwanza %<>%
mutate(age2 = as.factor(
case_when(
age1 == "1" | age1 == "2" ~ "15-24",
age1 == "3" | age1 == "4" ~ "25-34",
age1 == "5" | age1 == "6" ~ "35+")))
summary(mwanza)
```
# 1. Planning your model
_Without coding_, write a logistic regression model to investigate the association between:
- HIV status (outcome)
- lifetime number of sexual partners (`npa`) as a 4-level factor
Then build on this model by including schooling (`ed2`) as a binary variable.
# 2. Logistic regression
## 2a. Tabulation
Obtain a frequency table of `npa`.
(NB: possible solutions are at the end of the notebook)
```{r}
```
What is the most common number of lifetime sexual partners?
Cross-tabulate number of lifetime sexual partners with HIV status.
```{r}
```
## 2b. Unadjusted logistic regression
Fit a logistic model to estimate the magnitude of association between `npa` (as a factor) and HIV status.
```{r}
```
Is there evidence of association?
## 2c. Change baseline group
By default, the baseline level of comparison will be the smallest value. You might want to use the most prevalent level of `npa` as a baseline, in order to calculate OR relative to that level.
In order to do this, you need to relevel the factor. This is much more verbose than Stata!
```{r}
# Relevel the factor
mwanza$npa <- factor(mwanza$npa,
levels = c("2", "1", "3", "4"))
# Logistic regression (unchanged)
glm(case ~ npa,
family = "binomial",
data = mwanza) %>%
epiDisplay::logistic.display()
# Relevel the factor back, if you want
# mwanza$npa <- factor(mwanza$npa,
# levels = c("1", "2", "3", "4"))
```
## 2d. Logistic model with confounding
Now also include `age1` treated as a factor in your model (keeping 2 as the baseline level)
```{r}
```
What is your conclusion?
## 2e. Summary table
Make a table in Excel or by hand with the results of the analyses in section 2:
- crosstabulation of cases and controls according to npa
- OR (unadjusted and adjusted) with 95% CI
- p-values
# 3. More on confounding + intro to interaction
## 3a. School
Now check if the risk of HIV associated with `npa` and `age1` is confounded by attending school (`ed2`).
```{r}
```
## 3b.
This is how you fit a model with interaction: `*` (equivalent to `##` in Stata), and how you run a LRT. What do you conclude?
```{r}
# Model with interaction
logit_inter <- glm(case ~ npa * ed2 + age1,
family = "binomial",
data = mwanza)
# Model without interaction
logit_without <- glm(case ~ npa + ed2 + age1,
family = "binomial",
data = mwanza)
# Likelihood ratio test
epiDisplay::lrtest(logit_without, logit_inter)
# Note that ANOVA gives you the same χ statistic and df
# anova(logit_without, logit_inter)
```
# 4. Interaction with more than 2 levels
## 4a. An unexpected issue
Try fitting a model including an interaction between `npa` and `age1` and have a look at the results. What happens to the adjusted ORs?
(NB: unlike Stata, if you tried an LRT, R would give you a result, even though it would not be meaningful – without a warning!)
```{r}
```
Cross-tabulate `npa` and `age1`. What's the problem and how can we solve it?
```{r}
```
## 4b. Solving the issue
In order to fix the issue of data sparsity, we can combine levels 3 and 4 of `npa`.
```{r}
# Create a new variable, relevel and label
mwanza %<>%
mutate(partners = factor(
case_when(npa == "1" ~ "<=1",
npa == "2" ~ "2-4",
npa == "3" | npa == "4" ~ ">=5"),
levels = c("2-4", "<=1", ">=5")
))
# Check it worked well
mwanza %$% table(npa, partners, useNA = "ifany")
```
We can then use this new variable, `partners`, to create a model for interaction and compare it to a model without interaction with a LRT.
```{r}
```
# 5. Other solutions
What other possible workarounds can you come up with for the issue identified in 4a.?
-------------------------------------------------------------------------------
# Solutions
```{r 2a.1 solution}
mwanza %$% table(npa, useNA = "ifany")
```
```{r 2a.2 solution}
mwanza %$% ctable(npa, case, prop = "c", useNA = "no")
```
```{r 2b.1 solution}
glm(case ~ npa,
family = "binomial",
data = mwanza) %>%
epiDisplay::logistic.display()
```
*Solution 2b.2*:
There is very strong evidence of association between HIV status and number of sexual partners (LRT p < 0.001).
```{r 2d solution}
glm(case ~ npa + age1,
family = "binomial",
data = mwanza) %>%
epiDisplay::logistic.display()
```
*Solution 2d*
The OR estimates have slightly changed, showing the confounding effect of age. Even after accounting for age, however, there is still very strong evidence for an association between number of sexual partners and HIV status.
*Solution 2e*
+----------------+-------------+-------------+------------------------+----------------------+
| | HIV+ (col%) | HIV- (col%) | Unadjusted OR (95% CI) | Adjusted OR (95% CI) |
+----------------+-------------+-------------+------------------------+----------------------+
| 0-1 partners | 27 (15%) | 173 (31%) | 0.47 (0.29,0.75) | 0.51 (0.31,0.82) |
+----------------+-------------+-------------+------------------------+----------------------+
| 2-4 partners | 92 (50%) | 277 (50%) | 1 (baseline group) | 1 (baseline group) |
+----------------+-------------+-------------+------------------------+----------------------+
| 5-9 partners | 40 (22%) | 83 (15%) | 1.45 (0.93,2.26) | 1.3 (0.82,2.05) |
+----------------+-------------+-------------+------------------------+----------------------+
| 10+ partners | 24 (13%) | 19 (3%) | 3.8 (1.99,7.26) | 4.75 (2.42,9.35) |
+----------------+-------------+-------------+------------------------+----------------------+
| LRT p-value | < 0.001 | < 0.001 |
+--------------------------------------------+------------------------+----------------------+
| Missing values | 28 (3.7%) |
+--------------------------------------------------------------------------------------------+
```{r 3a solution}
glm(case ~ npa + age1 + ed2,
family = "binomial",
data = mwanza) %>%
epiDisplay::logistic.display()
```
*Solution 3a*
Even after adjusting for school, there is very strong evidence for an association.
*Solution 3b*
There is no evidence of interaction (LRT p = 0.92)
```{r 4a1 solution}
# Model with interaction
logit_inter2 <- glm(case ~ npa * age1,
family = "binomial",
data = mwanza)
epiDisplay::logistic.display(logit_inter2)
```
*Solution 4a1*
The interaction OR estimates for all levels when `npa` is 4 are extremely high, and their 95% CI go from 0 to positive infinity.
```{r Solution 4a2}
mwanza %$% table(age1, npa, case, useNA = "ifany")
```
*Solution 4a2*
One of the possible intersections is empty (the datasets contains no women with HIV of age group 1 and with 10+ lifetime partners).
```{r Solution 4b}
# Model with interaction
logit_inter3 <- glm(case ~ partners * age1,
family = "binomial",
data = mwanza)
epiDisplay::logistic.display(logit_inter3)
# Model without interaction
logit_without3 <- glm(case ~ partners + age1,
family = "binomial",
data = mwanza)
# Likelihood ratio test
epiDisplay::lrtest(logit_inter3, logit_without3)
```
*Solution4b*
There is no evidence of interaction.
*Solution5*
I'm not aware of a `lincom` equivalent in R.
-------------------------------------------------------------------------------