intro_regression_slides.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Introduction To Regression</title>
    <meta charset="utf-8" />
    <meta name="author" content="Evan Wyse" />
    <meta name="date" content="2020-03-02" />
    <link rel="stylesheet" href="sta210-slides.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# Introduction To Regression
## R Open Labs Workshop Series
### Evan Wyse
### 2020-03-02

---


class: middle, center

### Download slides at [http://bit.ly/duke_lib_regression](http://bit.ly/duke_lib_regression)

---

## Agenda

- What is regression?
- Fitting a model in R
- Interpreting Output
- Model Diagnostics
- Checking Assumptions
- Interactive Exercises + Q&amp;A

---


## Disclaimer

- Regression is a complicated and deep subject. While this talk is a solid introduction, there are some significant caveats to its use. There is a whole undergraduate course at Duke on regression (STA 210). As such, it's probably not a good idea to publish a paper based on what a statistics grad student taught you in an hour.
- These slides make significant use of the course material from STA 210, taught by Professor Maria Tackett
  - You can access course materials [here](http://bit.ly/sta210-fa19) - they provide significantly more detail than is available here

---


## Simple Linear Regression

- We observe a dataset `\(\mathbf{Y}\)` composed of `\(n\)` observations, `\(Y_1...Y_n\)`, and an explanatory variable `\(X_1...X_n\)` 
- Suspect that there is an (imperfect) linear relationship between `\(\mathbf{Y}\)` and `\(\mathbf{X}\)`, thus our model is `\(Y_i = \beta_0 + \beta_1x_{i1} +  \epsilon\)`
 - `\(\epsilon\)` is an error term - we assume that it's drawn from a normal (bell-curve) distribution with an unknown variance `\(\sigma^2\)`
- We don't know what `\(\beta_0, \beta_1\)`, or `\(\sigma^2\)` are - but we'd like to estimate them
  - We'll call our estimate for the unknown `\(\beta\)` and `\(\sigma^2\)` as `\(\hat{\beta}\)` and `\(\hat{\sigma^2}\)` respectively
  
---
## Regression Visualized
  

&lt;img src="intro_regression_slides_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /&gt;


---


## Expanding To Multiple Predictors

- Dataset of `\(n\)` observations of a response variable `\(\mathbf{Y}\)`, believed to be driven by `\(p\)` explanatory variables `\(\mathbf{X}\)` plus an intercept
- Each `\(Y_i = \beta_0 + \beta_1x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon\)`
- We can write this in matrix notation as `\(\mathbf{Y = X\beta}  + \epsilon\)`
- This allows us to estimate the individual impact that changes to a specific variable will have on future observations while controlling for the impact of other (correlated) variables

---


### Ordinary Least Squares (OLS) Regression

- Collectively, the standard technique for regression with one or more is called ordinary least squares (OLS)
- OLS finds the vector (straight line) that minimizes the squared vertical distance between the line and each of the data points
-- We refer to this squared distance as the &lt;font class="vocab"&gt;**sum of squared error**&lt;/font&gt;. We want to minimize it. 

&lt;img src="intro_regression_slides_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /&gt;

---

## Categorical Data

- Frequently, somes variables are discrete categories (gender, race, education level, etc)
- R will assume you'd like to regress an explanatory variable categorically if the column is stored as a `factor`, and generate the categories automatically for you
- We can capture this using linear regression by adding `\(k-1\)` binary (taking values 1 or 0) variables into our model for a variable with `\(k\)` different levels
- If `\(X_j\)` is a categorical variable:
  - `\(X_j = 0 \implies X_j\beta_j = 0\)`
  - `\(X_j = 1 \implies X_j\beta_j = \beta_j\)` 

---
## Example: Wage Data
- In the 1970s Harris Trust and Savings Bank was sued for discrimination on the basis of gender. The following dataset is a collection of wages for bank employees

#### Variables
**Explanatory**
- &lt;font class="vocab"&gt;`Educ`: &lt;/font&gt;Education, either 'HighSchool', 'Bachelors', or 'Graduate'
- &lt;font class="vocab"&gt;`Exper`: &lt;/font&gt;months of previous work experience (before hire at bank)
- &lt;font class="vocab"&gt;`Sex`: &lt;/font&gt;"Male" or "Female"
- &lt;font class="vocab"&gt;`Senior`: &lt;/font&gt;months worked at bank since hire
- &lt;font class="vocab"&gt;`Age`: &lt;/font&gt;age in months

**Response**
- &lt;font class="vocab"&gt;`Bsal`: &lt;/font&gt;annual salary at time of hire

---

## Glimpse of data


```r
glimpse(wages)
```

```
## Observations: 93
## Variables: 6
## $ Bsal      &lt;int&gt; 5040, 6300, 6000, 6000, 6000, 6840, 8100, 6000, 6000, 690...
## $ Sex       &lt;fct&gt; Male, Male, Male, Male, Male, Male, Male, Male, Male, Mal...
## $ Senior    &lt;int&gt; 96, 82, 67, 97, 66, 92, 66, 82, 88, 75, 89, 91, 66, 86, 9...
## $ Age       &lt;int&gt; 329, 357, 315, 354, 351, 374, 369, 363, 555, 416, 481, 33...
## $ Exper     &lt;dbl&gt; 14.0, 72.0, 35.5, 24.0, 56.0, 41.5, 54.5, 32.0, 252.0, 13...
## $ Education &lt;fct&gt; Graduate, Graduate, Graduate, Bachelor, Bachelor, Graduat...
```
---

## Fitting a model

- R allows you to use formula objects to interact with your data using column names

```r
model &lt;- lm(Bsal ~ Education + Exper + Sex + Age, data=wages)
broom::tidy(model) %&gt;% kable(format="markdown", digits=3) # View the output
```


|term                | estimate| std.error| statistic| p.value|
|:-------------------|--------:|---------:|---------:|-------:|
|(Intercept)         | 4541.806|   307.768|    14.757|   0.000|
|EducationGraduate   |  378.285|   131.869|     2.869|   0.005|
|EducationHighSchool | -256.727|   180.654|    -1.421|   0.159|
|Exper               |    0.051|     1.150|     0.045|   0.964|
|SexMale             |  746.467|   141.848|     5.262|   0.000|
|Age                 |    1.109|     0.786|     1.411|   0.162|
- Note that R has automatically converted the 'Sex' and 'Education' variables to categorical variables and added categories as necessary
  - The 'missing' category is captured by the intercept

---

## Additional Syntax in R
- Can also use `Bsal ~ .` to regress a column named `Bsal` against everything else in the data frame
- Can use `summary` function to obtain an easy-to-read output 

```r
model2 &lt;- lm(Bsal ~ ., data=wages)
summary(model)
```

```
## 
## Call:
## lm(formula = Bsal ~ Education + Exper + Sex + Age, data = wages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1050.48  -389.96   -24.56   321.94  2021.29 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(&gt;|t|)
## (Intercept)         4541.80562  307.76782  14.757  &lt; 2e-16
## EducationGraduate    378.28523  131.86917   2.869  0.00517
## EducationHighSchool -256.72742  180.65427  -1.421  0.15886
## Exper                  0.05148    1.15002   0.045  0.96440
## SexMale              746.46733  141.84809   5.262 1.01e-06
## Age                    1.10933    0.78617   1.411  0.16179
## 
## Residual standard error: 554.3 on 87 degrees of freedom
## Multiple R-squared:  0.423,	Adjusted R-squared:  0.3899 
## F-statistic: 12.76 on 5 and 87 DF,  p-value: 2.646e-09
```

---
## Interpreting the output

- **estimate**: the estimated value of the `\(\beta\)` coefficient for that explanatory variable.
  - For most coefficients, the way to interpret this is "*for every 1 unit increase in X, we observe a `\(\beta\)` unit increase in `\(Y\)`.*"
  - For the **intercept**: the interpretation is "*the expected (average) value for `\(Y\)` if all the `\(X\)` variables are `\(0\)`*". If we have categorical variables, the baseline category is included here. 
  
- **std.error**: The standard error estimate for the coefficient
- **statistic**: The t-statistic for deviation 
- **p.value**: The p-value implied by the t-statistic
  - The interpretation of the p-value for a particular coefficient `\(\hat{\beta_j}\)` is "the probability of calculating a `\(\hat{\beta_j}\)` this extreme or more extreme **assuming the null hypothesis is true** (in this case, null hypothesis is `\(\beta_j=0\)`)

---

## Prediction


```r
x_star &lt;- data.frame(Age=329, Education='HighSchool', Exper=14.0, Sex="Male")
predict(model, x_star, interval='prediction', level=0.95)
```

```
##        fit      lwr      upr
## 1 5397.236 4218.215 6576.257
```

- Code above shows how to obtain an estimate ('fit') as well as the lower and upper bounds of the 95% prediction interval
- Types of uncertainty estimates for predictions:
  - **Confidence interval** (interval='confidence') captures the uncertainty inherent in estimating `\(\beta\)` - this is our best guess for the average value of `\(Y\)` at `\(X\)` 
  - **Prediction interval** (interval='prediction') captures the uncertainty in obtaining `\(\hat{\beta}\)`, **plus** the uncertainty from the error inherent in `\(Y\)`

---

## Checking Assumptions of Linear Regression

- OLS only gives unbiased estimates if four assumptions are satisfied
  - **Linearity**: `\(Y\)` cannont depend on `\(\mathbf{X}\)` in a nonlinear way 
  - **Normality**: The error must be normally distributed, and centered at `\(0\)`. Note: `\(\mathbf{X}\)` can be distributed however you want - it's **just the error `\(\epsilon\)`**  that needs to be normally distributed
  - **Constant Error** The amount of error can't change as the predicted value changes
  - **Independence**: Each individual `\(Y_i\)` can't depend on any of the other `\(Y_i\)`'s except via their individual `\(X\)` values
- If these assumptions don't hold, the estimates `\(\hat{\beta}, \hat{\sigma}^2\)` (and the p-values) are not guaranteed to be accurate


---

### Assumption 1: Linearity
- **How to check**: Plot the predicted value `\(\hat{Y}\)` against residuals
  - Values should be centered around `\(0\)` at every value of `\(\hat{Y}\)`
- You can fix this by transforming `\(Y\)` or `\(X\)` to make the relationship linear - but remember then that your predictors, confidence intervals, etc, are all going to be in the transformed space

&lt;img src="intro_regression_slides_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /&gt;


---
### Assumption 1: Linearity

&lt;img src="intro_regression_slides_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /&gt;
- DON'T worry if the data is bunched in some areas left-to-right
- DO worry if the data appears to be bunched above/below the line


---
### Assumption 1: Linearity


&lt;img src="intro_regression_slides_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /&gt;
- DON'T worry if the data is bunched in some areas left-to-right
- DO worry if the data appears to be bunched above/below the line
  

---

### Assumption 2: Normality

- `\(\epsilon\)` must be distributed **normally** - i.e. from a bell curve
- **How to check**: Make a histogram and QQ-plot of the residuals, and examine if the data appears to be normally distributed
  - You should observe a roughly bell-shaped curve. Anything else indicates that the normality assumption is violated

---

### Assumption 2: Normality 
&lt;img src="intro_regression_slides_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /&gt;

- DON'T worry if the histogram shows a somewhat spikey pattern - this happens a lot just due to inherent randomness if your sample size is small
- DO worry if you see multiple modes emerge in the histogram - an 'M' shape is almost certainly evidence of a problem

---

## Assumption 3: Constant Error

- The expected squared error `\(\sigma^2\)` can't change as `\(X\)` changes
- **How to check**: Plot the predicted value `\(\hat{Y}\)` against residuals. The spread above/below zero shouldn't change. 

&lt;img src="intro_regression_slides_files/regression.png" width="80%" style="display: block; margin: auto;" /&gt;


---

### Assumption 3: Constant Error

&lt;img src="intro_regression_slides_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /&gt;

---

### Assumption 3: Constant Error

&lt;img src="intro_regression_slides_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /&gt;
- Note how the residuals get larger as the predicted value increases. This is bad. 

---

## Assumption 4: Independence

- Each `\(Y_i\)` can't depend in some way on any other `\(Y_j\)`, beyond what's captured in `\(X\)`
- Common issues with this assumption are:
  - **Serial effect**: If data are collected over time, there is a chance of autocorrelation in the dataset
  - **Cluster effect**: If `\(Y\)` depends on some variable that's not included in your model

---

### Example Residuals: Cluster Effect 


```r
ggplot(data=pew_data, mapping = aes(x=percapitaincome,y=residuals,color=State)) + 
  geom_point() + 
  geom_hline(yintercept=0,color="red") +
  labs(title="Residuals vs. Per Capita Income", 
       x="Per Capita Income ($)")
```
&lt;img src="intro_regression_slides_files/residuals_cluster_effect.jpg" width="80%" style="display: block; margin: auto auto auto 0;" /&gt;

---
 
### Example Residuals: Serial Effect

```r
ggplot(data=pew_data, aes(x=Year,y=residuals,color=State)) + geom_point() +
  geom_hline(yintercept=0,color="red")+
  labs("Residuals vs. Year") + 
  scale_x_continuous(breaks=seq(2000,2009,1))
```

&lt;img src="intro_regression_slides_files/residuals_serial_effect.jpg" width="80%" style="display: block; margin: auto auto auto 0;" /&gt;

---

### Common Scenarios That Violate Assumptions

- **I'm predicting one or more time series**: Most time series suffer from some amount of *autocorrelation*, which violates the independence assumption. A common fix is to calculate the growth rate between each time step, and run your regression on that, though this isn't guaranteed to 
- **I'm predicting an index value, like app ratings**: Because indexes are typically bounded, the normality assumption breaks down as we get closer to our bounds. Try dividing your data into , and using *multinomial regression*
- **I'm predicting the number of times something happens**: Similarly, as `\(Y\)` approaches `\(0\)`, the assumption of normality breaks down . This isn't a huge problem if your observations aren't close to zero. Otherwise, consider Poisson regression for a more appropriate model. 
- **I'm predicting a binary variable, with a yes/no reponse**: This will violate the normality assumption - use *logistic regression* instead

---

## Cautions
- Avoid extrapolation:
  - Relationships can change at different portions of the data
  - Almost all continuous functions are locally linear - but a nonlinear trend might emerge as you extend beyond the scope of your data
- Regression shows only correlation, not causation
  - Proving causality requires a carefully designed experiment or carefully accounting for confounding variables in an observational study
- Be careful of providing variables that are too correlated 
  - You can use model selection techniques to help understand which variables you should retain
- Model selection is an iteractive process
  - Don't be afraid to change your model based on the outcome of initial regressions
  - BUT, monkeying around with your model in pursuit of better p-values is not scientific. If you're deciding between a lot of different models, engage a statistician to help you capture the uncertainty associated with this process. 


---

### Important Topics We Didn't Cover:

- **Interaction terms**: What to do when some of your variables might produce an additional response when viewed together
- **Model selection**: How to know which variables to include in your model
- **Outlier detection**: Use of Cook's Distance, robust regression, and other techniques for handling outliers
- **Logistic Regression**: When your observed variable is a binary (yes/no) response
- **Multinomial Regression**: Similar to logistic regression, when your response is one or more discrete categories
- **Penalized regression**: Wide class of techniques used to obtain more stable estimates of `\(\beta\)` at the expense of an unbiased estimate
- **Poisson regression**: Used to model count-based data
- **Bayesian approaches to regression**: How to use priors to gain estimates of the distribution of `\(\hat{\beta}, \hat{\sigma^2}\)`

---
## Try it out

- Download and save the .Rmd from [here](https://raw.githubusercontent.com/wyseguy7/intro_regression/master/intro_regression.Rmd) so we can step through exercises together


---
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/remark/0.14.0/remark.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false,
"slideNumberFormat": "%current%"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://cdn.bootcss.com/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_HTMLorMML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>