2 - Transformations of predictor variables - Linearity problems

2.1 - Working with transformed samples

Sometimes we can focus our attention on the x: the linear model is termed linear not because the regression curve is a plane, but because the effects of the parameters are linear.

Rather than working with the sample

$$(x_{1}, y_{1}),...,(x_{n}, y_{n})$$

we consider the transformed sample

$$(\tilde{x}{1}, y{1}),...,(\tilde{x}{n}, y{n} )$$

Practically this adds a new column to the design matrix but does not change the fact that the model is still linear $Y = X\beta +\varepsilon$.

Furthermore, the interpretation does not change that much.

For example consider these linear models and the possible transformed samples:

$Y = \beta_{0} + \beta_{1}x^{2} + \varepsilon$
1. where we can work with $\tilde{x}{i}$ = $x{i}^{2}$
$Y = \beta_{0} + \beta_{1}\log(x) + \varepsilon$
1. where we can work with $\tilde{x}{i}$ = $\log(x{i})$
$Y = \beta_{0} + \beta_{1}(x^{3}-\log(|x|)+ 2^{x}) + \varepsilon$
1. where we can work with $\tilde{x}{i}$ = $x{i}^{3}-\log(|x_{i}|)+ 2^{x_{i}}$

This means, we are keeping the model linear for what concerns the estimates of the beta parameters, while the x can be in any form (quadratic, polynomial etc...) because they concern the design matrix.

Sometimes these transformations can help with violation of model assumptions and other times they can be used to simply fit a more flexible model!

2.1.1 - Car Dataset Example

Let's use the car dataset to to model mpg as a function of hp

We first attempt a SLR, but we see a rather obvious pattern in the fitted versus residuals plot, which includes increasing variance.

We attempt a log transform of the response (rough approximation)

After performing the log transform of the response, we still have some of the same issues with the fitted versus response. We try also log transforming the predictor.

Here, our fitted versus residuals plot looks good.

2.1.2 -Tip

If you apply a nonlinear transformation, namely $f()$, and fit the linear model $Y = \beta_{0} + \beta_{1}f(x) + \varepsilon$, then there is no point in fit also the model resulting from the negative transformation $-f()$.
1. The model with $-f()$ is exactly the same as the one with $f()$ but with the sign of $\beta_{1}$ flipped!
As a rule of thumb, use the next figure with the transformations to compare it with the data pattern, then choose the most similar curve, and finally apply the corresponding function with positive sign.

2.2 - Polynomials

A common ”transformation” of a predictor variable is the polynomial transformation Polynomials are very useful as they allow for more flexible models, but do not change the units of the variables.

Consider the non-linear transformation, namely $f$: $$Y = f(x) + \varepsilon$$

We make no global assumptions about the function $f$ but assume that locally it can be well approximated with a member of a simple class of parametric function, e.g. a constant or straight line

This is related to Taylor's theorem that says that any continuous function can be approximated with polynomial.

Taylor teorem Suppose $f$ is a real function on $[a,b]$, $f^{K-1}$ is continuous on $[a,b]$, $f^{K}(x)$ is bounded for $x\in(a,b)$ then for any distinct points $x_{0}< x_{1}$ in $[a,b]$ there exists a point $\tilde{x}$ between $x_{0}< \tilde{x} < x_{1}$ such that

$$f(x_{1})= f(x_{0}) + \sum_{k=1}^{K-1} \frac{f^{k}(x_{0})}{k!}(x_{1}-x_{0})^{k} + \frac{f^{K}(\tilde{x})}{K!}(x_{1}-x_{0})^{K}$$

Notice: if we view $f(x_{0})+\sum_{k=1}^{K-1} \frac{f^{k}(x_{0})}{k!}(x_{1}-x_{0})^{k}$ as function of $x_{1}$, it's a polynomial in the family of polynomials

$$\mathcal{P}{K+1} = { f(x) = a{0} + a_{1}x + ... + a_{K}x^{K}, (a_{0},...,a_{K})^{\text{'}} \in \mathbb{R}^{K+1} }$$

Using polynomials to approximate the predictors can be useful when we are not in high dimensional space. The only problems comes with extrapolation, since the estimate becomes more variable.

We could fit a polynomial of an arbitrary order $K$,

$$Y_{i} = \beta_{0} + \beta_{1}x_{i} + \beta_{2}x_{i}^{2}+...+ \beta_{K}x_{i}^{K} + \varepsilon_{i}$$

and we can think of the polynomial model as the Taylor series expansion of the unknown function

2.2.1 - Example Polynomials

Suppose you work for an automobile manufacturer which makes a large luxury sedan. You would like to know how the car performs from a fuel efficiency standpoint when it is driven at various speeds.

Instead of testing the car at every conceivable speed (which would be impossible) you create an experiment where the car is driven at speeds of interest in increments of 5 miles per hour (Response surface designs)

Our goal then, is to fit a model to this data in order to be able to predict fuel efficiency when driving at certain speeds

We see a pattern but it is no linear pattern! Let's say we are very stubborn and wish to fit a SLR to this data

econ <- read.csv("data/fuel-econ.csv")
fit1 <- lm(mpg ̃ mph, data = econ)
summary(fit1)

Call:
lm(formula = mpg ̃ mph, data = econ)

Residuals:
Min     1Q    Median 3Q     Max
-8.337 -4.895 -1.007 4.914 9.191

Coefficients:
              Estimate  Std. Error  t value Pr(>|t|)
(Intercept)   22.74637  2.49877     9.103   1.45e-09 ***
mph           0.03908   0.05312     0.736   0.469
...
Residual standard error: 5.666 on 26 degrees of freedom
Multiple R-squared: 0.02039, Adjusted R-squared: -0.01729
F-statistic: 0.5411 on 1 and 26 DF, p-value: 0.4686

Pretty clearly we can do better.

Yes fuel efficiency does increase as speed increases, but only up to a certain point.
We need to use include the quadratic trend of the data

We will now add polynomial terms until we fit a suitable fit

To add the second order term we need to use the I() function in the model specification.

I() is the identity function, which tells R “leave this alone”. It basically adds to the model matrix a new column with the transformation specified inside the brackets.

fit2 <- lm(mpg ̃ mph + I(mph ˆ 2), data = econ)
summary(fit2)

Call:
lm(formula = mpg ̃ mph + I(mphˆ2), data = econ)

Residuals:
Min     1Q      Median  3Q      Max
-2.8411 -0.9694 0.0017  1.0181  3.3900

Coefficients:
            Estimate    Std.Error   t value   Pr(>|t|)
(Intercept) 2.4444505   1.4241091   1.716     0.0984 .
mph         1.2716937   0.0757321   16.792    3.99e-15 ***
I(mphˆ2)    -0.0145014  0.0008719   -16.633   4.97e-15 ***
---
...
Residual standard error: 1.663 on 25 degrees of freedom
Multiple R-squared: 0.9188, Adjusted R-squared: 0.9123
F-statistic: 141.5 on 2 and 25 DF, p-value: 2.338e-14

The model becomes very significant, the residuals are almost centered around the 0!

While this model clearly fits much better, and the second order term is significant, we still see a pattern in the fitted versus residuals plot which suggests higher order terms will help.

Also, we would expect the curve to flatten as speed increases or decreases, not go sharply downward as we see here. In this case if we extrapolate we will find that the curve keeps going down as it was centered around the quadratic transformation of the x.

Two degrees of freedom and and a quadratic polynomial are not enough!

This is the case where there is a real (unknown) function dictated by the physics of the engines, but in our case we are just approximating it.
We could use a polynomial of third degree, but it is not useful since we are focusing on polynomials of positive degree. In fact, we want to change how the curve reacts at the extremes, we do not want to change its direction.

Let's try a polynomial of fourth degree.

fit4 <- lm(formula = mpg ̃ mph + I(mphˆ2) + I(mphˆ3) + I(mphˆ4), data = econ)

# Omitting data on purpose

The fourth order term is significant with the other terms in the model.

Also we are starting to see what we expected for low and high speed.

However, there still seems to be a bit of a pattern in the residuals, so we will again try more higher order terms.

We will add the fifth and sixth together, since adding the fifth will be similar to adding the third.

fit6 <- lm(formula = mpg ̃ mph + I(mphˆ2) + I(mphˆ3) + I(mphˆ4) + I(mphˆ5) +
            I(mphˆ6), data = econ)

# Omitting data on purpose

Again the sixth order term is significant with the other terms in the model and here we see less pattern in the residuals plot

2.2.2 - Confidence Intervals

Let’s now test for which of the previous two models we prefer. We will test:

$$H_{0}: \beta_{5} = \beta_{6} = 0$$

ANOVA(fit4, fit6)

Analysis of Variance Table

Model 1: mpg ̃ mph + I(mphˆ2) + I(mphˆ3) + I(mphˆ4)
Model 2: mpg ̃ mph + I(mphˆ2) + I(mphˆ3) + I(mphˆ4) + I(mphˆ5) + I(mphˆ6)

    Res.Df  RSS     Df  Sum of Sq   F       Pr(>F)
1   23      19.922
2   21      15.739  2   4.1828      2.7905  0.0842 .

This test does not reject the null hypothesis at a level of significance of $\alpha = 0.05$, however the p-value is still rather small, and the fitted versus residuals plot is much better for the model with the sixth order term. This makes the sixth order model a good choice.

We could repeat this process one more time with fit8 (you know how to proceed by now)

ANOVA(fit6, fit8)

Analysis of Variance Table
Model 1: mpg ̃ mph + I(mphˆ2) + I(mphˆ3) + I(mphˆ4) + I(mphˆ5) + I(mphˆ6)
Model 2: mpg ̃ mph + I(mphˆ2) + I(mphˆ3) + I(mphˆ4) + I(mphˆ5) + I(mphˆ6) +
I(mphˆ7) + I(mphˆ8)

    Res.Df  RSS     Df  Sum of Sq   F       Pr(>F)
1   21      15.739
2   19      15.506  2   0.2324      0.1424  0.8682

The eighth order term is not significant with the other terms in the model and the F-test does not reject.

2.2.3 - Make it quicker

There is a quicker way to specify a model with many higher order terms. The method produces the same fitted values ...

fit6_alt <- lm(mpg ̃ poly(mph, 6), data = econ)

all.equal(fitted(fit6), fitted(fit6_alt))
[1] TRUE

... but the estimated coefficients are different because poly() uses orthogonal polynomials. These orthogonal polynomials are useful and implemented to avoid problems when using operations on matrices.

coef(fit6)
(Intercept)     mph           I(mphˆ2)      
-4.206224e+00   4.203382e+00  -3.521452e-01
I(mphˆ3)        I(mphˆ4)
1.579340e-02    -3.472665e-04
I(mphˆ5)        I(mphˆ6)
3.585201e-06    -1.401995e-08

coef(fit6_alt)
(Intercept)     poly(mph, 6)1   poly(mph, 6)2   
24.40714286     4.16769628      -27.66685755 
poly(mph, 6)3   poly(mph, 6)4
 0.13446747     7.01671480
poly(mph, 6)5   poly(mph, 6)6
0.09288754      -2.04307796

To use poly() to obtain the same results as using I() repeatedly, we would need to set raw = TRUE

fit6_alt2 <- lm(mpg ̃ poly(mph, 6, raw = TRUE), data = econ)

coef(fit6_alt2)
(Intercept)     poly(mph, 6, raw = TRUE)1   poly(mph, 6, raw = TRUE)2
-4.206224e+00   4.203382e+00                -3.521452e-01
poly(mph, 6, raw = TRUE)3   poly(mph, 6, raw = TRUE)4 
1.579340e-02                -3.472665e-04
poly(mph, 6, raw = TRUE)5   poly(mph, 6, raw = TRUE)6
3.585201e-06                -1.401995e-08

Ta daaaan! This is still a linear relation in the polynomial sense of it.

6 - Case Study: Melting Artic

Background

Melting Arctic sea ice is monitored and used as an indicator for the impacts of climate change.

The summer ice in the Arctic Ocean reflects sunlight. As the ice melts, the much darker sea water absorbs sunlight. This feedback mechanism is understood as an important driver of climate change throughout geologic history.

Other effects: changes in ocean currents and atmospheric weather patterns as well as the possibility of releasing further greenhouse gases by accelerating the melting of Arctic permafrost on land and on the East Siberian Arctic Shelf.

September is the month when the ice stops melting each summer and reaches its minimum extent.

The data we will analyze is a time series of September Arctic sea ice extent from 1979 until 2012.

1 - Simple Possible Model

Let's model the relationship Extent-Time as:

$$Extent = f(Time)$$

Let's assume $f(\cdot)$ is linear, thus:

$$Extent = \beta_{0} + \beta_{1} \cdot Time$$

Scientific question:

”Is the September Artic sea ice extent decreasing with time ?”
The equivalent, mathematically speaking, is the question is $\beta_{1} < 0$?

As always the model: $Y_{i} = \beta_{0} + \beta_{1}x_{i} + \varepsilon$ for $i=1,...,n$

where $x_{i} = t_{i}$ which means $m(t_{i}) = \mathbb{E}[Y_{i}|t_{i}] = \beta_{0} + \beta_{1}t_{i}$

fit<-lm(extent ~ year);

plot(year,extent,ylab="1,000,000 km",
    main="Evolution of sea ice extent",pch=20,cex.lab=1.05)
abline(fit,lwd=2,col="red")

Fit is not too-bad, except for a few points...

Let's check the residuals

The plot of residuals versus the time that reinforces this point. The residuals from the linear regression tend to be

Negative in early and late years
Positive in the middle years

Look back at the data: the decrease is larger in the later years

In fact we notice a depper slope as we proceed with time.

Intuitively we can pass to the supposition that $\beta_{1}$ is changing with respect to time i.e $\beta_{i}(t)$. We consider $\beta_{1}$ a function of the time, that increases with the passing of time.

We then build simple model that allows for the slope to change at a constant rate:

$$m(t_{i}) = \mathbb{E}[Y_{i}|t_{i}] = \beta_{0} + \beta_{1}t_{i} + \beta_{2}t_{i}^{2} = \beta_{0} + \beta_{1} \left[ 1 + \frac{\beta_{2}}{\beta_{1}}\cdot t_{i} \right] \cdot t_{i} = \beta_{0} + \beta_{i}(t_{i})t_{i} $$

A model with a quadratic term corresponds to a model in which the slope is allowed to change as a function of the predictor.

Now we are really facing a quadratic model, where we allow the angular coefficient to change!

In linear model the first derivative is constant, with quadratic terms the first derivative varies. Similar concepts apply to higher order polynomials.

Obviously, this is not an ultimate model to study the extent of the ice, but we are merely approximating it with the 30 observations we are given.

Credits

1 - Lecture 7: Diagnostics and Modifications for Simple Regression by Cosma Shalizi
2 - Power Transformations on Wikipedia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7.1 - TRANSFORMATIONS2.md

7.1 - TRANSFORMATIONS2.md

2 - Transformations of predictor variables - Linearity problems

2.1 - Working with transformed samples

2.1.1 - Car Dataset Example

2.1.2 -Tip

2.2 - Polynomials

2.2.1 - Example Polynomials

2.2.2 - Confidence Intervals

2.2.3 - Make it quicker

6 - Case Study: Melting Artic

Background

1 - Simple Possible Model

Credits

Files

7.1 - TRANSFORMATIONS2.md

Latest commit

History

7.1 - TRANSFORMATIONS2.md

File metadata and controls

2 - Transformations of predictor variables - Linearity problems

2.1 - Working with transformed samples

2.1.1 - Car Dataset Example

2.1.2 -Tip

2.2 - Polynomials

2.2.1 - Example Polynomials

2.2.2 - Confidence Intervals

2.2.3 - Make it quicker

6 - Case Study: Melting Artic

Background

1 - Simple Possible Model

Credits