-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
320 lines (237 loc) · 10.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
library(knitr)
knitr::opts_chunk$set(
collapse = FALSE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# diftrans
<!-- badges: start -->
![status](https://img.shields.io/badge/status-under%20construction-yellow)
<!-- badges: end -->
## Table of Contents
* [Introduction](#introduction)
* [Installation](#installation)
* [Attribution](#attribution)
* [Example](#example)
## Introduction
The `diftrans` package applies the novel methods described in Daljord, Pouliot, Hu, and Xiao (2021) to compute the transport costs between two univariate distributions and the differences-in-transports estimator.
Appealing to optimal transport theory, the `diftrans` package builds off of the
[`transport` package](https://cran.r-project.org/web/packages/transport/index.html)
to compute the change between the distributions of a univariate variable of interest.
Extending this application, the `diftrans` package allows users to compute the *before-and-after estimator* (Daljord et al., 2021), which controls for sampling uncertainty by trivializing small changes in the distributions.
The `diftrans` package also computes the *differences-in-transports estimator*, which controls for unobservable reasons that the distributions may change by comparing the transport costs to those from another source of measurement.
The only function in the `diftrans` package is `diftrans`, which is explained below by way of a toy example.
The documentation file for `diftrans::diftrans` can be found by running the following in an R console:
```{r eval = F}
?diftrans::diftrans
```
## Installation
Install the `diftrans` from [GitHub](https://github.com/omkarakatta/diftrans) with:
``` r
# install.packages("devtools")
devtools::install_github("omkarakatta/diftrans")
```
To install the package with the vignette installed, set the `build_vignettes`
argument to `TRUE`.
Note that the installation will take more time this way.
```r
# install.packages("devtools")
devtools::install_github("omkarakatta/diftrans", build_vignettes = TRUE)
```
## Attribution
To cite the `diftrans` package, use the BibTeX entry provided by:
``` r
citation("diftrans")
```
## Example
The workhorse function in the `diftrans` package is also called `diftrans`, which serves two purposes:
* compute the transport cost between two univariate distributions, and
* compute the differences-in-transports estimator (see Daljord et al. (2021)).
### Setup
```{r echo = F, message = F, warning = F}
library(dplyr)
library(magrittr)
devtools::load_all()
mykable <- function(dt){
kable(dt,
align = 'c')
}
```
```{r eval = F}
library(dplyr)
library(magrittr)
library(diftrans)
```
We begin by computing the transport cost between two distributions of some
variable `x`. These distributions should be represented as tibbles with two columns:
* column 1 contains the full support of the distribution, and
* column 2 (labeled "count") contains the mass/counts associated with each value in the support.
Suppose that the shift in the distribution is due to some treatment.
We refer to the first and second distributions of some random variable `x` as the
pre-distribution and post-distribution for the treated group, respectively.
Below are the tibbles for the pre- and post-distributions as well as the
corresponding plots.
Both tibbles contain the full support of our variable of interest, i.e., their
`support` column is the same.
*Pre-Distribution for Treated Group*
```{r echo = F}
set.seed(1)
support_min <- 1
support_max <- 10
support <- seq(support_min, support_max, 1)
pre_treated <- data.frame(x = sample(support, 100, replace = T)) %>%
group_by(x) %>%
count() %>%
ungroup() %>%
rename(count = n)
shift <- 5
for (i in 1:shift){
pre_treated[pre_treated$x == support_max-i+1, "count"] <- 0
}
mykable(pre_treated)
```
*Post-Distribution for Treated Group*
```{r echo = F}
post_treated <- data.frame(x = support,
count = lag(pre_treated$count, shift)) %>%
tidyr::replace_na(list(count = 0))
mykable(post_treated)
```
Observe that all the mass from the pre-distribution was shifted by `r shift` units
of the support to form the post-distribution. That is, the treatment resulted in all
the mass to change. For instance, the mass that was given
given to `r pre_treated[1, "x"]` in the pre-distribution is now given
to `r pre_treated[1+shift, "x"]`.
In this toy example, we should expect that the transport cost is 100% because
all the mass was transported due to the treatment.
### Compute Transport Cost
We can compute the transport cost as follows:
```{r echo = T}
tc <- diftrans(pre_main = pre_treated, post_main = post_treated,
estimator = "tc", var = x,
bandwidth = 0)
tc
```
**The transport cost is `r scales::percent(tc$main)`** as expected.
The `estimator` argument is specified to be `"tc"`, which stands for transport cost.
(Since we only have two distributions, estimator will take on the value of `"tc"` by default.)
Since the post-distribution is a `r shift`-unit shift in the pre-distribution, any bandwidth
at least `r shift` must result in a transport cost of 0. We verify this by computing
the transport cost for a sequence of bandwidths from 0 to 10:
```{r}
tc <- diftrans(pre_main = pre_treated, post_main = post_treated,
estimator = "tc", var = x,
bandwidth = seq(0, 10))
tc
```
### Setup for Difference-in-Transports
Above, we assumed that the treatment is sole reason why any shifts between the pre- and post-distributions of `x` might occur.
Now suppose there are unobserved trends in our variable that might also explain this shift.
To elicit how much of this shift is due to the treatment
and how much of this shift is due to unobserved trends, we can compare our distributions
above with distributions of the same variable measured in some control group that
did not receive the treatment but was also subjected to the same unobserved trends.
The novel differences-in-transports estimator described in Daljord et al. (2021)
quantifies the change in distribution due to some treatment relative to the change
in distribution of some control group.
```{r echo = F}
control_shift <- 2
pre_control <- data.frame(x = support,
count = pre_treated$count)
post_control <- data.frame(x = support,
count = lag(pre_treated$count, control_shift)) %>%
tidyr::replace_na(list(count = 0))
```
First, we need to set up the distributions of our variable for our control group,
both before and after the treatment was administered to the treatment group.
For our purposes, we let the pre-distribution of our control group be the same
as that of our treated group. Then, let our post-distribution for our control group
be a `r control_shift`-unit shift of our pre-distribution. We therefore have the
following tibbles that represent our control group distributions:
*Pre-Distribution for Control Group*
```{r echo = F}
mykable(pre_control)
```
*Post-Distribution for Control Group*
```{r echo = F}
mykable(post_control)
```
### Compute Differences-in-Transports Estimator <sup id="a1">[1](#f1)</sup>
Above, we used `estimator = "tc"` and specified our pre- and post-distributions
for our treated group.
Now, we use `estimator = "dit"` (which stands for differences-in-treatments estimator)
while also specifying our pre- and post-distributions for our control group.
```{r echo = T}
dit <- diftrans(pre_main = pre_treated, post_main = post_treated,
pre_control = pre_control, post_control = post_control,
estimator = "dit", var = x,
bandwidth_seq = seq(0, 10, 1),
save_dit = TRUE)
dit$out
dit$dit
dit$optimal_bandwidth
```
Another difference between computing the transport cost and the differences-in-differences
estimator is that the differences-in-transports estimator is printed as a message.
To save this result in `dit`, we use `save_dit = TRUE`.
Thus, **the differences-in-transports estimator is `r scales::percent(dit$dit)` at an optimal bandwidth of `r dit$optimal_bandwidth`**.
#### Sampling Variability
Often times, the data will be a sample from a population,
and sample variability is enough to cause differences between distributions.
To account for this discrepancy, we can increase the bandwidth to ignore
transfers of mass between nearby values of the support.
```{r echo = F}
d_a <- 1
```
While Daljord et al. (2021) offer a disciplined way to determine what the appropriate
bandwidths are, for simplicity, we will suppose that there will not be any sampling
variation with a bandwidth greater than `r d_a`.
```{r echo = T, eval = F}
d_a <- 1
```
```{r echo = T}
dit <- diftrans(pre_main = pre_treated, post_main = post_treated,
pre_control = pre_control, post_control = post_control,
estimator = "dit", var = x,
bandwidth_seq = seq(d_a, 10, 1), # smallest bandwidth will be d_a
save_dit = TRUE)
dit$out
dit$dit
dit$optimal_bandwidth
```
Note that this restriction is not binding, so our result is unaffected.
**Accounting for sampling variability, our differences-in-transports estimate is still `r scales::percent(dit$dit)`**.
#### Conservative Differences-in-Transports
If we want to be more conservative, we can use
`conservative = TRUE`, which essentially uses twice the bandwidth for computing the
treated group's transport costs relative to the bandwidth for computing the control
group's transport costs.
```{r}
dit <- diftrans(pre_main = pre_treated, post_main = post_treated,
pre_control = pre_control, post_control = post_control,
estimator = "dit", var = x,
bandwidth_seq = seq(d_a, 10, 1),
save_dit = TRUE,
conservative = TRUE)
dit$out
dit$dit
dit$optimal_bandwidth
```
Finally, we have that the **conservative differences-in-transports estimator is `r scales::percent(dit$dit)` at an optimal bandwidth of `r dit$optimal_bandwidth`**.
Note that `estimator = "dit"` is not necessary. Since we have two sets of pre- and
post-distributions, `diftrans` is smart enough to return the differences-in-transports
estimator by default.
------
<b id="f1">1</b> There is more nuance to computing the differences-in-transports estimator than
is presented in this expository document. See Daljord et al. (2021) for a more
complete treatment on how to compute the differences-in-transports estimator. [↩](#a1)
<!-- https://stackoverflow.com/a/32119820 -->