forked from nipraxis/textbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
hypothesis_tests.Rmd
348 lines (273 loc) · 10.5 KB
/
hypothesis_tests.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
jupyter:
jupytext:
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.11.5
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
# Hypothesis testing with the general linear model
## General linear model reprise
This page starts at the same place as [introduction to the general linear model](https://matthew-brett.github.io/teaching/glm_intro.html).
```{python}
# Import numerical and plotting libraries
import numpy as np
import numpy.linalg as npl
import matplotlib.pyplot as plt
# Only show 6 decimals when printing
np.set_printoptions(precision=6)
```
In that page, we had questionnaire measures of psychopathy from 12 students:
```{python}
psychopathy = np.array([11.416, 4.514, 12.204, 14.835,
8.416, 6.563, 17.343, 13.02,
15.19 , 11.902, 22.721, 22.324])
```
We also had skin-conductance scores from the palms of the each of the same 12
students, to get a measure of how sweaty they are:
```{python}
clammy = np.array([0.389, 0.2 , 0.241, 0.463,
4.585, 1.097, 1.642, 4.972,
7.957, 5.585, 5.527, 6.964])
```
We believe that the `clammy` score has some straight-line relationship to
the `psychopathy` scores. $n$ is the number of elements in `psychopathy`
and `clammy`: $n = 12$. Call the 12 values for `psychopathy` $\vec{y} =
[y_1, y_2, .... , y_n]$. The 12 values for `clammy` are $\vec{x} = [x_1,
x_2, ... , x_n]$. Our straight line model is:
$$
\newcommand{\yvec}{\vec{y}}
\newcommand{\xvec}{\vec{x}}
\newcommand{\evec}{\vec{\varepsilon}}
\newcommand{Xmat}{\boldsymbol X}
\newcommand{\bvec}{\vec{\beta}}
\newcommand{\bhat}{\hat{\bvec}}
\newcommand{\yhat}{\hat{\yvec}}
\newcommand{\ehat}{\hat{\evec}}
\newcommand{\cvec}{\vec{c}}
\newcommand{\rank}{\textrm{rank}}
y_i = c + b x_i + e_i
$$
where $c$ is the intercept, $b$ is the slope, and $e_i$ is the remainder of
$y_i$ after subtracting $c + b x_i$.
We then defined a new vector $\evec = [e_1, e_2, ... e_n]$ for remaining
error, and rewrote the same formula in vector notation:
$$
\yvec = c + b \xvec + \evec
$$
We defined a new $n=12$ element vector $\vec{1}$ containing all ones, and
used this to build a two-column *design matrix* $\Xmat$, with first column
$\vec{1}$ and second column $\vec{x}$. This allowed us to rewrite the vector
formulation as a matrix multiplication and addition:
$$
\yvec = \Xmat \bvec + \evec
$$
where $\bvec$ is:
$$
\left[
\begin{array}{\bvec}
c \\
b \\
\end{array}
\right]
$$
<!-- note:
We will often use vectors, such as $\vec{x}$, in matrix operations, such
as $\boldsymbol X \vec{x}$, where $\boldsymbol X$ is a matrix. When we do
this, we assume the default that for any vector $\vec{v}$, $\vec{v}$ is a
column vector, and therefore that $\vec{v}^T$ is a row vector. -->
Using the matrix formulation of the general linear model, we found the least
squares *estimate* for $\bvec$ is:
$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec
$$
The formula above applies when $\Xmat^T \Xmat$ is invertible. Generalizing to
the case where $\Xmat^T \Xmat$ is not invertible, the least squares estimate
is:
$$
\bhat = \Xmat^+ \yvec
$$
where $\Xmat^+$ is the [Moore-Penrose pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse) of $\Xmat$.
The `^` on $\bhat$ reminds us that this is an *estimate* of $\bvec$. We
derived this $\bhat$ estimate from our sample, hoping that it will be a
reasonable estimate for the $\bvec$ that applies to the whole population.
# The residual error
$\bhat$ gives us a corresponding estimate of $\evec$:
$$
\ehat = \yvec - \Xmat \bhat
$$
The least squares criterion that we used to derive $\bhat$ specifies that
$\bhat$ is the vector giving us the smallest sum of squares of $\ehat$. We can
write that criterion for $\bhat$ like this:
$$
\bhat = \textrm{argmin}_{\bvec} \sum_{i=1}^n e_i^2
$$
Read this as “$\bhat$ is the value of the vector $\bvec$ that gives the
minimum value for the sum of the squared residual errors”.
From now on, we will abbreviate $\sum_{i=1}^n e_i^2$ as $\sum e_i^2$, assuming
it is the sum over all elements index $1 .. n$.
Remembering the definition of the dot product, we can also write $\sum e_i^2$
as the dot product of $\ehat$ with itself:
$$
\sum e_i^2 \equiv \ehat \cdot \ehat
$$
Read $\equiv$ as “equivalent to”. We can also express $\sum e_i^2$ as the
matrix multiplication of $\ehat$ as a row vector with $\ehat$ as a column
vector. Because we assume that vectors are column vectors in matrix
operations, we can write that formulation as:
$$
\sum e_i^2 \equiv \ehat^T \ehat
$$
# Unbiased estimate of population variance
We will soon need an unbiased estimate of the population variance. The
population variance is $\frac{1}{N} \sum e_i^2$ where the population has $N$
elements, and $e_1, e_2, ... e_N$ are the remaining errors for all $N$
observations in the population.
However, we do not have all $N$ observations in the population, we only have a
$n$-size *sample* from the population. In our particular case $n=12$.
We could use the sample variance as this estimate: $\frac{1}{n} \sum e_i$.
Unfortunately, for [reasons](https://en.wikipedia.org/wiki/Bessel%27s_correction) we don’t have space to go
into, this is a *biased estimate of the population variance*.
To get an unbiased estimate of the variance, we need to allow for the number
of independent columns in the design $\Xmat$. The number of independent
columns in the design is given by the [matrix rank](http://matthew-brett.github.io/teaching/matrix_rank.html) of $\Xmat$. Specifically,
if $\rank(\Xmat)$ is the matrix rank of $\Xmat$, an unbiased estimate of
population variance is given by:
$$
\hat\sigma^2 = \frac{1}{n - \rank(\Xmat)} \sum e_i^2
$$
For example, we saw in the [worked example of GLM](https://github.com/bic-berkeley/psych-214-fall-2014/mean_test_example.html), that when we have a
single regressor, and $\rank(\Xmat) = 1$, we divide the sum of squares of the
residuals by $n - 1$ where $n$ is the number of rows in the design. This
$n-1$ divisor is [Bessel’s correction](https://en.wikipedia.org/wiki/Bessel%27s_correction).
We will also use these terms below:
* $\rank(\Xmat)$: *degrees of freedom of the design*;
* $n - \rank(\Xmat)$: *degrees of freedom of the error*.
# Hypothesis testing
We used contrast vectors to form particular linear combinations of the
parameter estimates in $\bhat$. For example, we used the contrast vector
$\cvec = [0, 1]$ to select the estimate for $b$ – the slope of the line:
$$
b = [0, 1] \bhat
$$
## t tests using contrast vectors
The formula for a t statistic test on any linear combination of the parameters
in $\bhat$ is:
$$
\newcommand{\cvec}{\vec{c}}
t = \frac{\cvec^T \bhat}
{\sqrt{\hat{\sigma}^2 \cvec^T (\Xmat^T \Xmat)^+ \cvec}}
$$
where $\hat{\sigma^2}$ is our unbiased estimate of the population variance.
Here is the t statistic calculation in Python:
```{python}
# Data vector
y = psychopathy
# Covariate vector
x = clammy
# Contrast vector as column vector
c = np.array([[0, 1]]).T
n = len(y)
# Design matrix
X = np.ones((n, 2))
X[:, 1] = x
# X.T X is invertible
iXtX = npl.inv(X.T.dot(X))
# Least-squares estimate of B
B = iXtX.dot(X.T).dot(y)
e = y - X.dot(B)
# Degrees of freedom of design
rank_x = npl.matrix_rank(X)
# The two columns are not colinear, so rank is 2
rank_x
```
```{python}
# Unbiased estimate of population variance
df_error = n - rank_x
s2_hat = e.dot(e) / df_error
t = c.T.dot(B) / np.sqrt(s2_hat * c.T.dot(iXtX).dot(c))
t
```
# F tests
F tests are another way to test hypotheses about the linear models. They are
particularly useful for testing whether there is a significant reduction in
the residual error when adding one or more regressors.
The simplest and generally most useful way of thinking of F test is as a test
comparing two models: a *full model* and a *reduced model*. The full model
contains the regressors that we want to test. We will use $\Xmat_f$ for the
full model. The reduced model is a model that does not contain the regressors
we want to test, but does contain all other regressors in the full model
. We will use $\Xmat_r$ for the reduced model.
In our case, $\Xmat_f$ is the model containing the `clammy` regressor, as
well as the column of ones that models the intercept.
$\Xmat_r$ is our original model, that only contains the column of ones.
If the full model is a better fit to the data than the reduced model, then
adding the new regressor(s) will cause a convincing drop in the size of
residuals.
The F test is a measure that reflects the drop in the magnitude of squared
residuals as a result of adding the new regressors.
Now we define the $SSR(\Xmat_r)$ and $SSR(\Xmat_f)$. These are the Sums of
Squares of the Residuals of the reduced and full model respectively.
$$
\bhat_r = \Xmat_r^+ \yvec \\
\hat\evec_r = \yvec - \Xmat_r \bhat_r \\
SSR(\Xmat_r) = \hat\evec_r^T \hat\evec_r \\
\bhat_f = \Xmat_f^+ \yvec \\
\hat\evec_f = \yvec - \Xmat_f \bhat_f \\
SSR(\Xmat_f) = \hat\evec_f^T \hat\evec_f
$$
$ESS = SSR(\Xmat_r) - SSR(\Xmat_f)$ is the Extra Sum of Squared residuals
explained by the full compared to the reduced model. The top half of the
ratio that forms the F statistic is $ESS / \nu_1$, where $\nu_1$ is the number
of extra independent regressors (columns) in $\Xmat_f$ compared to $\Xmat_r$.
Specifically:
$$
\nu_1 = \rank(\Xmat_f) - \rank(\Xmat_r)
$$
The bottom half of the F statistic is the estimated variance $\hat{\sigma^2}$
from the full model. This can also be written as $SSR(\Xmat_f) / \nu_2$ where
$\nu_2$ is the *degrees of freedom of the error*:
$$
\begin{eqnarray}
F_{\nu_1, \nu_2} & = &
\frac{
(\hat\evec_r^T \hat\evec_r - \hat\evec_f^T \hat\evec_f)
/ \nu_{1} }
{\hat\evec_f^T \hat\evec_f / \nu_{2}} \\
& = &
\frac{
(\textrm{SSR}(\Xmat_r) - \textrm{SSR}(\Xmat_f)) / \nu_1}
{\textrm{SSR}(\Xmat_f) / \nu_2}
\end{eqnarray}
$$
Here is the F-statistic calculation in Python:
```{python}
# We already have X, e, rank_x, for the full model, from
# the t calculation
X_f, e_f, rank_f = X, e, rank_x
# Now calculate the same for the reduced model
X_r = np.ones((n, 1))
iXtX_r = npl.inv(X_r.T.dot(X_r))
B_r = iXtX_r.dot(X_r.T).dot(y)
e_r = y - X_r.dot(B_r)
rank_r = npl.matrix_rank(X_r) # One column, rank 1
rank_r
```
```{python}
# Calculate the F statistic
SSR_f = e_f.dot(e_f)
SSR_r = e_r.dot(e_r)
nu_1 = rank_f - rank_r
F = ((SSR_r - SSR_f) / nu_1) / (SSR_f / (n - rank_f))
F
```
For reasons that we haven’t explained here, the F statistic for a single
column is the square of the t statistic testing the same column:
```{python}
t ** 2
```