03-lasso.Rmd

# OLS and lasso

## Load packages

```{r load_packages}
library(glmnet)
library(ggplot2)
```

## Load data 

Load `train_x_reg`, `train_y_reg`, `test_x_reg`, and `test_y_reg` variables we defined in 02-preprocessing.Rmd for the OLS and Lasso *regression* tasks. 

```{r}
# Objects: task_reg, task_class
load("data/preprocessed.RData")
```

## Overview

* LASSO = sets Beta coefficients of unrelated (to Y) predcitors to zero

* RIDGE = sets Beta coefficients of unrealted (to Y) predictors NEAR ZERO but does not remove them

* ELASTICNET = a combination of LASSO and RIDGE

Below is an refresher of ordinary least squares linear (OLS) regression that predicts age using the other variables as predictors.  

```{r}
# Fit the regression model; lm() will automatically add a temporary intercept column
fit = lm(train_y_reg ~ ., data = train_x_reg)

# View the output
summary(fit) 

# Predict outcome for the test data
predicted = predict(fit, test_x_reg)

# 8. Calculate mean-squared error
(mse_reg = mean((test_y_reg - predicted)^2))

# Root mean-squared error
# We are off ~7.54 years for predicting age
sqrt(mse_reg)
```

Review "Challenge 0" in the Challenges folder for a useful review of how OLS regression works and [see the yhat blog](http://blog.yhat.com/posts/r-lm-summary.html) for help interpreting its output. 

Linear regression is a useful introduction to machine learning, but in your research you might be faced with warning messages after `predict()` about the [rank of your matrix](https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it).

The lasso is useful to try and remove some of the non-associated features from the model. Because glmnet expects a matrix of predictors, use `as.matrix` to convert it from a data frame to a matrix. 

Be sure to [read the glmnet vignette](https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf)

## Fit model

```{r}
lasso = cv.glmnet(as.matrix(train_x_reg), train_y_reg, family = "gaussian", alpha = 1)
```

## Investigate Results

Visualize the distribution of log(lamba) vs mean-squared error.   

```{r}
plot(lasso) 
# Help interpreting this plot: https://stats.stackexchange.com/questions/404795/interpretation-of-cross-validation-plot-for-lasso-regression

# Generate our own version, but plot lambda (not on log scale) vs. RMSE.
qplot(lasso$lambda, sqrt(lasso$cvm)) + theme_minimal()

```

> NOTE: when log(lamba) is equal to 0 that means lambda is equal to 1. In this graph, the far right side is overpenalized, as the model is emphasizing the beta coefficients being small. As log(lambda) becomes increasingly negative, lambda is correspondingly closer to zero and we are approaching the OLS solution. 

```{r}
# And here is a plot of log(lambda) vs lambda.
qplot(log(lasso$lambda), lasso$lambda) + theme_minimal()
```

Show plot of different lambda values: 

```{r}
plot(lasso$glmnet.fit, xvar = "lambda", label = TRUE)
```

Show the lambda that results in the minimum estimated mean-squared error (MSE):

```{r}
lasso$lambda.min
```

Show higher lambda within [one standard error](https://stats.stackexchange.com/questions/80268/empirical-justification-for-the-one-standard-error-rule-when-using-cross-validat) of performance of the minimum

```{r}
lasso$lambda.1se

# Log scale versions:
log(c("log_min" = lasso$lambda.min, "log_1se" = lasso$lambda.1se))
```

Look at the coefficients
```{r}
(coef_1se = coef(lasso, s = "lambda.1se"))
```

Look at the coefficients for lambda.min

```{r}
(coef_min = coef(lasso, s = "lambda.min"))

# Compare side-by-side
cbind(as.matrix(coef_1se), as.matrix(coef_min))
```

Predict on the test set
```{r}
predictions = predict(lasso, newx = as.matrix(test_x_reg),
                      s = lasso$lambda.1se)

# How far off were we, based on absolute error?
rounded_errors = round(abs(test_y_reg - predictions))
table(rounded_errors)

# Group the absolute error into 4 bins.
grouped_errors = round(abs(test_y_reg - predictions) / 5)
grouped_errors[grouped_errors > 2] = 3
table(grouped_errors)

# 4 categories of accuracy
how_close = factor(grouped_errors, labels = c("very close", "close", "meh", "far"))
table(rounded_errors, how_close)

# Scatter plot of actual vs. predicted
qplot(test_y_reg, predictions, 
      color = how_close) +
  geom_point(size = 5, alpha = 0.7) + 
  theme_minimal()
```

Calculate MSE and RMSE:
```{r}
# Calculate mean-squared error.
mean((predictions - test_y_reg)^2)

# Calculate root mean-squared error.
sqrt(mean((predictions - test_y_reg)^2))
```

## Challenge 1
Open Challenge 1 in the "Challenges" folder.