6final.Rmd

# Final Words

```{r setup2vf, include=FALSE}
require(knitr)
require(kableExtra)
require(DiagrammeR)
require(DiagrammeRsvg)
require(rsvg)
library(magrittr)
library(svglite)
library(png)
require(nhanesA)
library(skimr)
library(jtools)
require(cobalt)
require(tableone)
require(Publish)
```

## Select variables judiciously


```{r role, echo = FALSE, out.width = "650px", fig.cap="Variable roles: A = exposure or treatment; Y = outcome; L = confounder; R = risk factor for Y; M = mediator; C = collider; E = effect of Y; I = instrument; u = unmeasured confounder; P = proxy of U; N = noise variable"}
knitr::include_graphics("images/role.png")
```

- Think about the role of variables first
  - ideally include confounders to reduce bias
  - consider including risk factor for outcome for greater accuracy
  - IV, collider, mediators, effect of outcome, noise variables should be avoided
  - if something is unmeasured, consider adding proxy (with caution)
- If you do not have subject area expertise, talk to experts
- do pre-screening
    - sparse binary variables
    - highly collinear variables

```{block, type='rmdcomment'}
Relying on just a blackbox ML method may be dangerous to identify the roles of variables in the relationship of interest.  
```


## Why SL and TMLE

### Prediction goal

```{r dag1, echo=FALSE, cache=TRUE}
g2 <- grViz("
	digraph causal {
	
	  # Nodes
    node [shape = circle]
    y [label = 'Length of stay']
    
    node [shape = box]
    a [label = 'RHC']
    b [label = 'age']
    c [label = 'sex']
    d [label = 'race']
    e [label = 'education']
    f [label = 'Disease.category']
    g [label = 'Cancer']
    h [label = 'Income']
    
	  # Edges
	  edge [color = black,
	        arrowhead = vee]
	  rankdir = LR
	  {b c d e f g h} -> y
	  a -> y
	  
	  # Graph
	  graph [overlap = true, fontsize = 10]
	}")
g2 %>% export_svg %>% charToRaw %>% rsvg %>% png::writePNG("images/dagpred.png")
```

```{r plot1, echo=FALSE}
knitr::include_graphics("images/dagpred.png")
```

- Assuming all covariates are measured, **parametric models** such as linear and logistic regressions are very efficient, but relies on strong assumptions. In real-world scenarios, it is often hard (if not impossible) to guess the correct specification of the right hand side of the regression equation.
- Machine learning (ML) methods are very helpful for prediction goals. They are also helpful in **identifying complex functions** (non-linearities and non-additive terms) of the covariates (again, assuming they are measured). 
- There are many ML methods, but the procedures are very different, and they come with their own advantages and disadvantages. In a given real data, it is **hard to apriori predict which is the best ML algorithm** for a given problem.


```{block, type='rmdcomment'}
Super learner is helpful in **combining strength from various algorithms**, and producing 1 prediction column that has **optimal statistical properties**.
```

### Causal inference

```{r dag2, echo=FALSE, cache=TRUE}
g2 <- grViz("
	digraph causal {
	
	  # Nodes
    node [shape = circle]
    a [label = 'RHC']
    y [label = 'Length of stay']
    
    node [shape = box]
    b [label = 'age']
    c [label = 'sex']
    d [label = 'race']
    e [label = 'education']
    f [label = 'Disease.category']
    g [label = 'Cancer']
    h [label = 'Income']
    
	  # Edges
	  edge [color = black,
	        arrowhead = vee]
	  rankdir = LR
	  {b c d e f g h} -> y
	  {b c d e f g h} -> a
	  a -> y
	  
	  # Graph
	  graph [overlap = true, fontsize = 10]
	}")
g2 %>% export_svg %>% charToRaw %>% rsvg %>% png::writePNG("images/dagci.png")
```

```{r plot2, echo=FALSE}
knitr::include_graphics("images/dagci.png")
```

- For causal inference goals (when we have a primary exposure of interest), machine learning methods are often misleading. This is primarily due to the fact that they usually do not have an inherent mechanism of focusing on **primary exposure** (RHC in this example); and treats the primary exposure as any other predictors. 
- When using g-computation with ML methods, estimation of variance becomes a difficult problem (with correct coverage). Generalized procedures such as **robust SE or bootstrap methods** are not supported by theory.


```{block, type='rmdcomment'}
TMLE method shine, with the help of it's important **statistical properties (double robustness, finite sample properties)**.
```

### Identifiability assumptions

However, causal inference requires satisfying identifiability assumptions for us to interpret causality based on association measures from statistical models (see below). Many of these assumptions are not empirically testable. That is why, it is extremely important to work with **subject area experts** to assess the plausibility of those assumptions in the given context.  

```{block, type='rmdcomment'}
No ML method, no matter how fancy it is,  can automatically produce estimates that can be directly interpreted as causal, unless the identifiability assumptions are properly taken into account. 
```

|  |  |  |
|-|-|-|
|Conditional Exchangeability| $Y(1), Y(0) \perp A | L$ |Treatment assignment is independent of the potential outcome, given covariates|
|Positivity| $0 < P(A=1 | L) < 1$ |Subjects are eligible to receive both treatment, given covariates|
|Consistency| $Y = Y(a)  \forall A=a$ |No multiple version of the treatment; and well defined treatment|

## Further reading

### Key articles

- TMLE Procedure: 
  - @luque2018targeted
  - @schuler2017targeted
- Super learner: 
  - @rose2013mortality 
  - @naimi2018stacked

### Additional readings

- @rose2020intersections
- @snowden2011implementation
- @naimi2017introduction
- @austin2015moving
- @naimi2017challenges
- @balzer2021demystifying
  
### Workshops

Highly recommend joining SER if interested in Epi methods development. The following workshops and summer course are very useful.

- [SER Workshop](https://epiresearch.org/) Targeted Learning: Causal Inference Meets Machine Learning by Alan Hubbard Mark van der Laan, 2021
- [SER Workshop](https://epiresearch.org/) Introduction to Parametric and Semi-parametric Estimators for Causal Inference by Laura B. Balzer & Jennifer Ahern, 2020
- [SER Workshop](https://epiresearch.org/) Machine Learning and Artificial Intelligence for Causal Inference and Prediction: A Primer by Naimi A, 2021
- [SISCER](https://si.biostat.washington.edu/suminst/archives/SISCER2021/CR2106) Modern Statistical Learning for Observational Data by Marco Carone, David Benkeser, 2021

### Recorded webinars

The following webinars and workshops are freely accessible, and great for understanding the intuitions, theories and mechanisms behind these methods!

#### Introductory materials

- [An Introduction to Targeted Maximum Likelihood Estimation of Causal Effects](https://www.youtube.com/watch?v=8Q9dfW3oOi4) by Susan Gruber (Putnam Data Sciences)
- [Practical Considerations for Specifying a Super Learner](https://www.youtube.com/watch?v=WYnjja8DKPg) by Rachael Phillips (Putnam Data Sciences)


#### More theory talks

- [Targeted Machine Learning for Causal Inference based on Real World Data](https://www.youtube.com/watch?v=PrPNP5RVcLg) by Mark van der Laan (Putnam Data Sciences)
- [An introduction to Super Learning](https://www.youtube.com/watch?v=1zT17HtvtF8) by Eric Polly (Putnam Data Sciences)
- [Cross-validated Targeted Maximum Likelihood Estimation (CV-TMLE)](https://www.youtube.com/watch?v=MDmddX267Ys) by Alan Hubbard (Putnam Data Sciences)
- [Higher order Targeted Maximum Likelihood Estimation](https://www.youtube.com/watch?v=2jumfnRQpxs) by Mark van der Laan (Online Causal Inference Seminar)
- [Targeted learning for the estimation of drug safety and effectiveness: Getting better answers by asking better questions](http://bcooltv.mcgill.ca/FDownloader.aspx?rid=e3143be2-918d-49d9-82ce-4dfea75ef1dc&DLType=VGAMP4) by Mireille Schnitzer (CNODES)

#### More applied talks

- [Applications of Targeted Maximum Likelihood Estimation](https://www.youtube.com/watch?v=foY7HoCeo88) by Laura Balzar (UCSF Epi & Biostats)
- [Applying targeted maximum likelihood estimation to pharmacoepidemiology](https://www.cnodes.ca/online-lecture/targeted-learning-estimation/) by Menglan Pang (CNODES)

#### Blog

- [Kat’s Stats](https://www.khstats.com/) by Katherine Hoffman
- [towardsdatascience](https://towardsdatascience.com/targeted-maximum-likelihood-tmle-for-causal-inference-1be88542a749) by Yao Yang
- [The Research Group of Mark van der Laan](https://vanderlaan-lab.org/post/) by Mark van der Laan