-
Notifications
You must be signed in to change notification settings - Fork 2
/
6final.Rmd
206 lines (155 loc) · 8.38 KB
/
6final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
# Final Words
```{r setup2vf, include=FALSE}
require(knitr)
require(kableExtra)
require(DiagrammeR)
require(DiagrammeRsvg)
require(rsvg)
library(magrittr)
library(svglite)
library(png)
require(nhanesA)
library(skimr)
library(jtools)
require(cobalt)
require(tableone)
require(Publish)
```
## Select variables judiciously
```{r role, echo = FALSE, out.width = "650px", fig.cap="Variable roles: A = exposure or treatment; Y = outcome; L = confounder; R = risk factor for Y; M = mediator; C = collider; E = effect of Y; I = instrument; u = unmeasured confounder; P = proxy of U; N = noise variable"}
knitr::include_graphics("images/role.png")
```
- Think about the role of variables first
- ideally include confounders to reduce bias
- consider including risk factor for outcome for greater accuracy
- IV, collider, mediators, effect of outcome, noise variables should be avoided
- if something is unmeasured, consider adding proxy (with caution)
- If you do not have subject area expertise, talk to experts
- do pre-screening
- sparse binary variables
- highly collinear variables
```{block, type='rmdcomment'}
Relying on just a blackbox ML method may be dangerous to identify the roles of variables in the relationship of interest.
```
## Why SL and TMLE
### Prediction goal
```{r dag1, echo=FALSE, cache=TRUE}
g2 <- grViz("
digraph causal {
# Nodes
node [shape = circle]
y [label = 'Length of stay']
node [shape = box]
a [label = 'RHC']
b [label = 'age']
c [label = 'sex']
d [label = 'race']
e [label = 'education']
f [label = 'Disease.category']
g [label = 'Cancer']
h [label = 'Income']
# Edges
edge [color = black,
arrowhead = vee]
rankdir = LR
{b c d e f g h} -> y
a -> y
# Graph
graph [overlap = true, fontsize = 10]
}")
g2 %>% export_svg %>% charToRaw %>% rsvg %>% png::writePNG("images/dagpred.png")
```
```{r plot1, echo=FALSE}
knitr::include_graphics("images/dagpred.png")
```
- Assuming all covariates are measured, **parametric models** such as linear and logistic regressions are very efficient, but relies on strong assumptions. In real-world scenarios, it is often hard (if not impossible) to guess the correct specification of the right hand side of the regression equation.
- Machine learning (ML) methods are very helpful for prediction goals. They are also helpful in **identifying complex functions** (non-linearities and non-additive terms) of the covariates (again, assuming they are measured).
- There are many ML methods, but the procedures are very different, and they come with their own advantages and disadvantages. In a given real data, it is **hard to apriori predict which is the best ML algorithm** for a given problem.
```{block, type='rmdcomment'}
Super learner is helpful in **combining strength from various algorithms**, and producing 1 prediction column that has **optimal statistical properties**.
```
### Causal inference
```{r dag2, echo=FALSE, cache=TRUE}
g2 <- grViz("
digraph causal {
# Nodes
node [shape = circle]
a [label = 'RHC']
y [label = 'Length of stay']
node [shape = box]
b [label = 'age']
c [label = 'sex']
d [label = 'race']
e [label = 'education']
f [label = 'Disease.category']
g [label = 'Cancer']
h [label = 'Income']
# Edges
edge [color = black,
arrowhead = vee]
rankdir = LR
{b c d e f g h} -> y
{b c d e f g h} -> a
a -> y
# Graph
graph [overlap = true, fontsize = 10]
}")
g2 %>% export_svg %>% charToRaw %>% rsvg %>% png::writePNG("images/dagci.png")
```
```{r plot2, echo=FALSE}
knitr::include_graphics("images/dagci.png")
```
- For causal inference goals (when we have a primary exposure of interest), machine learning methods are often misleading. This is primarily due to the fact that they usually do not have an inherent mechanism of focusing on **primary exposure** (RHC in this example); and treats the primary exposure as any other predictors.
- When using g-computation with ML methods, estimation of variance becomes a difficult problem (with correct coverage). Generalized procedures such as **robust SE or bootstrap methods** are not supported by theory.
```{block, type='rmdcomment'}
TMLE method shine, with the help of it's important **statistical properties (double robustness, finite sample properties)**.
```
### Identifiability assumptions
However, causal inference requires satisfying identifiability assumptions for us to interpret causality based on association measures from statistical models (see below). Many of these assumptions are not empirically testable. That is why, it is extremely important to work with **subject area experts** to assess the plausibility of those assumptions in the given context.
```{block, type='rmdcomment'}
No ML method, no matter how fancy it is, can automatically produce estimates that can be directly interpreted as causal, unless the identifiability assumptions are properly taken into account.
```
| | | |
|-|-|-|
|Conditional Exchangeability| $Y(1), Y(0) \perp A | L$ |Treatment assignment is independent of the potential outcome, given covariates|
|Positivity| $0 < P(A=1 | L) < 1$ |Subjects are eligible to receive both treatment, given covariates|
|Consistency| $Y = Y(a) \forall A=a$ |No multiple version of the treatment; and well defined treatment|
## Further reading
### Key articles
- TMLE Procedure:
- @luque2018targeted
- @schuler2017targeted
- Super learner:
- @rose2013mortality
- @naimi2018stacked
### Additional readings
- @rose2020intersections
- @snowden2011implementation
- @naimi2017introduction
- @austin2015moving
- @naimi2017challenges
- @balzer2021demystifying
### Workshops
Highly recommend joining SER if interested in Epi methods development. The following workshops and summer course are very useful.
- [SER Workshop](https://epiresearch.org/) Targeted Learning: Causal Inference Meets Machine Learning by Alan Hubbard Mark van der Laan, 2021
- [SER Workshop](https://epiresearch.org/) Introduction to Parametric and Semi-parametric Estimators for Causal Inference by Laura B. Balzer & Jennifer Ahern, 2020
- [SER Workshop](https://epiresearch.org/) Machine Learning and Artificial Intelligence for Causal Inference and Prediction: A Primer by Naimi A, 2021
- [SISCER](https://si.biostat.washington.edu/suminst/archives/SISCER2021/CR2106) Modern Statistical Learning for Observational Data by Marco Carone, David Benkeser, 2021
### Recorded webinars
The following webinars and workshops are freely accessible, and great for understanding the intuitions, theories and mechanisms behind these methods!
#### Introductory materials
- [An Introduction to Targeted Maximum Likelihood Estimation of Causal Effects](https://www.youtube.com/watch?v=8Q9dfW3oOi4) by Susan Gruber (Putnam Data Sciences)
- [Practical Considerations for Specifying a Super Learner](https://www.youtube.com/watch?v=WYnjja8DKPg) by Rachael Phillips (Putnam Data Sciences)
#### More theory talks
- [Targeted Machine Learning for Causal Inference based on Real World Data](https://www.youtube.com/watch?v=PrPNP5RVcLg) by Mark van der Laan (Putnam Data Sciences)
- [An introduction to Super Learning](https://www.youtube.com/watch?v=1zT17HtvtF8) by Eric Polly (Putnam Data Sciences)
- [Cross-validated Targeted Maximum Likelihood Estimation (CV-TMLE)](https://www.youtube.com/watch?v=MDmddX267Ys) by Alan Hubbard (Putnam Data Sciences)
- [Higher order Targeted Maximum Likelihood Estimation](https://www.youtube.com/watch?v=2jumfnRQpxs) by Mark van der Laan (Online Causal Inference Seminar)
- [Targeted learning for the estimation of drug safety and effectiveness: Getting better answers by asking better questions](http://bcooltv.mcgill.ca/FDownloader.aspx?rid=e3143be2-918d-49d9-82ce-4dfea75ef1dc&DLType=VGAMP4) by Mireille Schnitzer (CNODES)
#### More applied talks
- [Applications of Targeted Maximum Likelihood Estimation](https://www.youtube.com/watch?v=foY7HoCeo88) by Laura Balzar (UCSF Epi & Biostats)
- [Applying targeted maximum likelihood estimation to pharmacoepidemiology](https://www.cnodes.ca/online-lecture/targeted-learning-estimation/) by Menglan Pang (CNODES)
#### Blog
- [Kat’s Stats](https://www.khstats.com/) by Katherine Hoffman
- [towardsdatascience](https://towardsdatascience.com/targeted-maximum-likelihood-tmle-for-causal-inference-1be88542a749) by Yao Yang
- [The Research Group of Mark van der Laan](https://vanderlaan-lab.org/post/) by Mark van der Laan