-
Notifications
You must be signed in to change notification settings - Fork 1
/
intro_regression_slides.html
496 lines (367 loc) · 20.1 KB
/
intro_regression_slides.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>
<title>Introduction To Regression</title>
<meta charset="utf-8" />
<meta name="author" content="Evan Wyse" />
<meta name="date" content="2020-03-02" />
<link rel="stylesheet" href="sta210-slides.css" type="text/css" />
</head>
<body>
<textarea id="source">
class: center, middle, inverse, title-slide
# Introduction To Regression
## R Open Labs Workshop Series
### Evan Wyse
### 2020-03-02
---
class: middle, center
### Download slides at [http://bit.ly/duke_lib_regression](http://bit.ly/duke_lib_regression)
---
## Agenda
- What is regression?
- Fitting a model in R
- Interpreting Output
- Model Diagnostics
- Checking Assumptions
- Interactive Exercises + Q&A
---
## Disclaimer
- Regression is a complicated and deep subject. While this talk is a solid introduction, there are some significant caveats to its use. There is a whole undergraduate course at Duke on regression (STA 210). As such, it's probably not a good idea to publish a paper based on what a statistics grad student taught you in an hour.
- These slides make significant use of the course material from STA 210, taught by Professor Maria Tackett
- You can access course materials [here](http://bit.ly/sta210-fa19) - they provide significantly more detail than is available here
---
## Simple Linear Regression
- We observe a dataset `\(\mathbf{Y}\)` composed of `\(n\)` observations, `\(Y_1...Y_n\)`, and an explanatory variable `\(X_1...X_n\)`
- Suspect that there is an (imperfect) linear relationship between `\(\mathbf{Y}\)` and `\(\mathbf{X}\)`, thus our model is `\(Y_i = \beta_0 + \beta_1x_{i1} + \epsilon\)`
- `\(\epsilon\)` is an error term - we assume that it's drawn from a normal (bell-curve) distribution with an unknown variance `\(\sigma^2\)`
- We don't know what `\(\beta_0, \beta_1\)`, or `\(\sigma^2\)` are - but we'd like to estimate them
- We'll call our estimate for the unknown `\(\beta\)` and `\(\sigma^2\)` as `\(\hat{\beta}\)` and `\(\hat{\sigma^2}\)` respectively
---
## Regression Visualized
<img src="intro_regression_slides_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
---
## Expanding To Multiple Predictors
- Dataset of `\(n\)` observations of a response variable `\(\mathbf{Y}\)`, believed to be driven by `\(p\)` explanatory variables `\(\mathbf{X}\)` plus an intercept
- Each `\(Y_i = \beta_0 + \beta_1x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon\)`
- We can write this in matrix notation as `\(\mathbf{Y = X\beta} + \epsilon\)`
- This allows us to estimate the individual impact that changes to a specific variable will have on future observations while controlling for the impact of other (correlated) variables
---
### Ordinary Least Squares (OLS) Regression
- Collectively, the standard technique for regression with one or more is called ordinary least squares (OLS)
- OLS finds the vector (straight line) that minimizes the squared vertical distance between the line and each of the data points
-- We refer to this squared distance as the <font class="vocab">**sum of squared error**</font>. We want to minimize it.
<img src="intro_regression_slides_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
---
## Categorical Data
- Frequently, somes variables are discrete categories (gender, race, education level, etc)
- R will assume you'd like to regress an explanatory variable categorically if the column is stored as a `factor`, and generate the categories automatically for you
- We can capture this using linear regression by adding `\(k-1\)` binary (taking values 1 or 0) variables into our model for a variable with `\(k\)` different levels
- If `\(X_j\)` is a categorical variable:
- `\(X_j = 0 \implies X_j\beta_j = 0\)`
- `\(X_j = 1 \implies X_j\beta_j = \beta_j\)`
---
## Example: Wage Data
- In the 1970s Harris Trust and Savings Bank was sued for discrimination on the basis of gender. The following dataset is a collection of wages for bank employees
#### Variables
**Explanatory**
- <font class="vocab">`Educ`: </font>Education, either 'HighSchool', 'Bachelors', or 'Graduate'
- <font class="vocab">`Exper`: </font>months of previous work experience (before hire at bank)
- <font class="vocab">`Sex`: </font>"Male" or "Female"
- <font class="vocab">`Senior`: </font>months worked at bank since hire
- <font class="vocab">`Age`: </font>age in months
**Response**
- <font class="vocab">`Bsal`: </font>annual salary at time of hire
---
## Glimpse of data
```r
glimpse(wages)
```
```
## Observations: 93
## Variables: 6
## $ Bsal <int> 5040, 6300, 6000, 6000, 6000, 6840, 8100, 6000, 6000, 690...
## $ Sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Mal...
## $ Senior <int> 96, 82, 67, 97, 66, 92, 66, 82, 88, 75, 89, 91, 66, 86, 9...
## $ Age <int> 329, 357, 315, 354, 351, 374, 369, 363, 555, 416, 481, 33...
## $ Exper <dbl> 14.0, 72.0, 35.5, 24.0, 56.0, 41.5, 54.5, 32.0, 252.0, 13...
## $ Education <fct> Graduate, Graduate, Graduate, Bachelor, Bachelor, Graduat...
```
---
## Fitting a model
- R allows you to use formula objects to interact with your data using column names
```r
model <- lm(Bsal ~ Education + Exper + Sex + Age, data=wages)
broom::tidy(model) %>% kable(format="markdown", digits=3) # View the output
```
|term | estimate| std.error| statistic| p.value|
|:-------------------|--------:|---------:|---------:|-------:|
|(Intercept) | 4541.806| 307.768| 14.757| 0.000|
|EducationGraduate | 378.285| 131.869| 2.869| 0.005|
|EducationHighSchool | -256.727| 180.654| -1.421| 0.159|
|Exper | 0.051| 1.150| 0.045| 0.964|
|SexMale | 746.467| 141.848| 5.262| 0.000|
|Age | 1.109| 0.786| 1.411| 0.162|
- Note that R has automatically converted the 'Sex' and 'Education' variables to categorical variables and added categories as necessary
- The 'missing' category is captured by the intercept
---
## Additional Syntax in R
- Can also use `Bsal ~ .` to regress a column named `Bsal` against everything else in the data frame
- Can use `summary` function to obtain an easy-to-read output
```r
model2 <- lm(Bsal ~ ., data=wages)
summary(model)
```
```
##
## Call:
## lm(formula = Bsal ~ Education + Exper + Sex + Age, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1050.48 -389.96 -24.56 321.94 2021.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4541.80562 307.76782 14.757 < 2e-16
## EducationGraduate 378.28523 131.86917 2.869 0.00517
## EducationHighSchool -256.72742 180.65427 -1.421 0.15886
## Exper 0.05148 1.15002 0.045 0.96440
## SexMale 746.46733 141.84809 5.262 1.01e-06
## Age 1.10933 0.78617 1.411 0.16179
##
## Residual standard error: 554.3 on 87 degrees of freedom
## Multiple R-squared: 0.423, Adjusted R-squared: 0.3899
## F-statistic: 12.76 on 5 and 87 DF, p-value: 2.646e-09
```
---
## Interpreting the output
- **estimate**: the estimated value of the `\(\beta\)` coefficient for that explanatory variable.
- For most coefficients, the way to interpret this is "*for every 1 unit increase in X, we observe a `\(\beta\)` unit increase in `\(Y\)`.*"
- For the **intercept**: the interpretation is "*the expected (average) value for `\(Y\)` if all the `\(X\)` variables are `\(0\)`*". If we have categorical variables, the baseline category is included here.
- **std.error**: The standard error estimate for the coefficient
- **statistic**: The t-statistic for deviation
- **p.value**: The p-value implied by the t-statistic
- The interpretation of the p-value for a particular coefficient `\(\hat{\beta_j}\)` is "the probability of calculating a `\(\hat{\beta_j}\)` this extreme or more extreme **assuming the null hypothesis is true** (in this case, null hypothesis is `\(\beta_j=0\)`)
---
## Prediction
```r
x_star <- data.frame(Age=329, Education='HighSchool', Exper=14.0, Sex="Male")
predict(model, x_star, interval='prediction', level=0.95)
```
```
## fit lwr upr
## 1 5397.236 4218.215 6576.257
```
- Code above shows how to obtain an estimate ('fit') as well as the lower and upper bounds of the 95% prediction interval
- Types of uncertainty estimates for predictions:
- **Confidence interval** (interval='confidence') captures the uncertainty inherent in estimating `\(\beta\)` - this is our best guess for the average value of `\(Y\)` at `\(X\)`
- **Prediction interval** (interval='prediction') captures the uncertainty in obtaining `\(\hat{\beta}\)`, **plus** the uncertainty from the error inherent in `\(Y\)`
---
## Checking Assumptions of Linear Regression
- OLS only gives unbiased estimates if four assumptions are satisfied
- **Linearity**: `\(Y\)` cannont depend on `\(\mathbf{X}\)` in a nonlinear way
- **Normality**: The error must be normally distributed, and centered at `\(0\)`. Note: `\(\mathbf{X}\)` can be distributed however you want - it's **just the error `\(\epsilon\)`** that needs to be normally distributed
- **Constant Error** The amount of error can't change as the predicted value changes
- **Independence**: Each individual `\(Y_i\)` can't depend on any of the other `\(Y_i\)`'s except via their individual `\(X\)` values
- If these assumptions don't hold, the estimates `\(\hat{\beta}, \hat{\sigma}^2\)` (and the p-values) are not guaranteed to be accurate
---
### Assumption 1: Linearity
- **How to check**: Plot the predicted value `\(\hat{Y}\)` against residuals
- Values should be centered around `\(0\)` at every value of `\(\hat{Y}\)`
- You can fix this by transforming `\(Y\)` or `\(X\)` to make the relationship linear - but remember then that your predictors, confidence intervals, etc, are all going to be in the transformed space
<img src="intro_regression_slides_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
---
### Assumption 1: Linearity
<img src="intro_regression_slides_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />
- DON'T worry if the data is bunched in some areas left-to-right
- DO worry if the data appears to be bunched above/below the line
---
### Assumption 1: Linearity
<img src="intro_regression_slides_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
- DON'T worry if the data is bunched in some areas left-to-right
- DO worry if the data appears to be bunched above/below the line
---
### Assumption 2: Normality
- `\(\epsilon\)` must be distributed **normally** - i.e. from a bell curve
- **How to check**: Make a histogram and QQ-plot of the residuals, and examine if the data appears to be normally distributed
- You should observe a roughly bell-shaped curve. Anything else indicates that the normality assumption is violated
---
### Assumption 2: Normality
<img src="intro_regression_slides_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
- DON'T worry if the histogram shows a somewhat spikey pattern - this happens a lot just due to inherent randomness if your sample size is small
- DO worry if you see multiple modes emerge in the histogram - an 'M' shape is almost certainly evidence of a problem
---
## Assumption 3: Constant Error
- The expected squared error `\(\sigma^2\)` can't change as `\(X\)` changes
- **How to check**: Plot the predicted value `\(\hat{Y}\)` against residuals. The spread above/below zero shouldn't change.
<img src="intro_regression_slides_files/regression.png" width="80%" style="display: block; margin: auto;" />
---
### Assumption 3: Constant Error
<img src="intro_regression_slides_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
---
### Assumption 3: Constant Error
<img src="intro_regression_slides_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
- Note how the residuals get larger as the predicted value increases. This is bad.
---
## Assumption 4: Independence
- Each `\(Y_i\)` can't depend in some way on any other `\(Y_j\)`, beyond what's captured in `\(X\)`
- Common issues with this assumption are:
- **Serial effect**: If data are collected over time, there is a chance of autocorrelation in the dataset
- **Cluster effect**: If `\(Y\)` depends on some variable that's not included in your model
---
### Example Residuals: Cluster Effect
```r
ggplot(data=pew_data, mapping = aes(x=percapitaincome,y=residuals,color=State)) +
geom_point() +
geom_hline(yintercept=0,color="red") +
labs(title="Residuals vs. Per Capita Income",
x="Per Capita Income ($)")
```
<img src="intro_regression_slides_files/residuals_cluster_effect.jpg" width="80%" style="display: block; margin: auto auto auto 0;" />
---
### Example Residuals: Serial Effect
```r
ggplot(data=pew_data, aes(x=Year,y=residuals,color=State)) + geom_point() +
geom_hline(yintercept=0,color="red")+
labs("Residuals vs. Year") +
scale_x_continuous(breaks=seq(2000,2009,1))
```
<img src="intro_regression_slides_files/residuals_serial_effect.jpg" width="80%" style="display: block; margin: auto auto auto 0;" />
---
### Common Scenarios That Violate Assumptions
- **I'm predicting one or more time series**: Most time series suffer from some amount of *autocorrelation*, which violates the independence assumption. A common fix is to calculate the growth rate between each time step, and run your regression on that, though this isn't guaranteed to
- **I'm predicting an index value, like app ratings**: Because indexes are typically bounded, the normality assumption breaks down as we get closer to our bounds. Try dividing your data into , and using *multinomial regression*
- **I'm predicting the number of times something happens**: Similarly, as `\(Y\)` approaches `\(0\)`, the assumption of normality breaks down . This isn't a huge problem if your observations aren't close to zero. Otherwise, consider Poisson regression for a more appropriate model.
- **I'm predicting a binary variable, with a yes/no reponse**: This will violate the normality assumption - use *logistic regression* instead
---
## Cautions
- Avoid extrapolation:
- Relationships can change at different portions of the data
- Almost all continuous functions are locally linear - but a nonlinear trend might emerge as you extend beyond the scope of your data
- Regression shows only correlation, not causation
- Proving causality requires a carefully designed experiment or carefully accounting for confounding variables in an observational study
- Be careful of providing variables that are too correlated
- You can use model selection techniques to help understand which variables you should retain
- Model selection is an iteractive process
- Don't be afraid to change your model based on the outcome of initial regressions
- BUT, monkeying around with your model in pursuit of better p-values is not scientific. If you're deciding between a lot of different models, engage a statistician to help you capture the uncertainty associated with this process.
---
### Important Topics We Didn't Cover:
- **Interaction terms**: What to do when some of your variables might produce an additional response when viewed together
- **Model selection**: How to know which variables to include in your model
- **Outlier detection**: Use of Cook's Distance, robust regression, and other techniques for handling outliers
- **Logistic Regression**: When your observed variable is a binary (yes/no) response
- **Multinomial Regression**: Similar to logistic regression, when your response is one or more discrete categories
- **Penalized regression**: Wide class of techniques used to obtain more stable estimates of `\(\beta\)` at the expense of an unbiased estimate
- **Poisson regression**: Used to model count-based data
- **Bayesian approaches to regression**: How to use priors to gain estimates of the distribution of `\(\hat{\beta}, \hat{\sigma^2}\)`
---
## Try it out
- Download and save the .Rmd from [here](https://raw.githubusercontent.com/wyseguy7/intro_regression/master/intro_regression.Rmd) so we can step through exercises together
---
</textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/remark/0.14.0/remark.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false,
"slideNumberFormat": "%current%"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
window.dispatchEvent(new Event('resize'));
});
(function(d) {
var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
if (!r) return;
s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
d.head.appendChild(s);
})(document);
(function(d) {
var el = d.getElementsByClassName("remark-slides-area");
if (!el) return;
var slide, slides = slideshow.getSlides(), els = el[0].children;
for (var i = 1; i < slides.length; i++) {
slide = slides[i];
if (slide.properties.continued === "true" || slide.properties.count === "false") {
els[i - 1].className += ' has-continuation';
}
}
var s = d.createElement("style");
s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
var deleted = false;
slideshow.on('beforeShowSlide', function(slide) {
if (deleted) return;
var sheets = document.styleSheets, node;
for (var i = 0; i < sheets.length; i++) {
node = sheets[i].ownerNode;
if (node.dataset["target"] !== "print-only") continue;
node.parentNode.removeChild(node);
}
deleted = true;
});
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
const hlines = d.querySelectorAll('.remark-code-line-highlighted');
const preParents = [];
const findPreParent = function(line, p = 0) {
if (p > 1) return null; // traverse up no further than grandparent
const el = line.parentElement;
return el.tagName === "PRE" ? el : findPreParent(el, ++p);
};
for (let line of hlines) {
let pre = findPreParent(line);
if (pre && !preParents.includes(pre)) preParents.push(pre);
}
preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>
<script>
(function() {
var links = document.getElementsByTagName('a');
for (var i = 0; i < links.length; i++) {
if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
links[i].target = '_blank';
}
}
})();
</script>
<script>
slideshow._releaseMath = function(el) {
var i, text, code, codes = el.getElementsByTagName('code');
for (i = 0; i < codes.length;) {
code = codes[i];
if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
text = code.textContent;
if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
/^\$\$(.|\s)+\$\$$/.test(text) ||
/^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
code.outerHTML = code.innerHTML; // remove <code></code>
continue;
}
}
i++;
}
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = 'https://cdn.bootcss.com/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_HTMLorMML';
if (location.protocol !== 'file:' && /^https?:/.test(script.src))
script.src = script.src.replace(/^https?:/, '');
document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
</body>
</html>