-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathsingularity.qmd
154 lines (112 loc) · 5 KB
/
singularity.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: "Convergence, singularity and all that"
author: "Douglas Bates"
engine: julia
julia:
exeflags: ["--project", "--threads=auto"]
---
# What does it mean for a converged model to be singular?
Add the packages to be used
```{julia}
#| output: false
#| code-fold: true
using CairoMakie
using DataFrames
using LinearAlgebra
using MixedModels
using MixedModelsMakie
const progress=false
```
Fit a model for reaction time in the sleepstudy example, preserving information on the estimation progress (the `thin=1` optional argument)
```{julia}
m01 = let f = @formula reaction ~ 1 + days + (1 + days|subj)
fit(MixedModel, f, MixedModels.dataset(:sleepstudy); thin=1, progress)
end
print(m01)
```
The covariance matrix for each subject's random effects is evaluated from its "matrix square root", called the Cholesky factor.
```{julia}
λ = only(m01.λ)
```
The *transpose* of $\lambda$, written $\lambda'$, is an upper triangular matrix generated by "flipping" $\lambda$ about the main diagonal.
```{julia}
λ'
```
The product $\lambda * \lambda'$ will be symmetric.
The covariance matrix of the random effects, $\Sigma$, is this symmetric matrix scaled by $\sigma^2$
```{julia}
Σ = m01.σ^2 * λ * λ'
```
The estimated variances of the random effects, which are the diagonal elements of $\Sigma$, correspond to the values shown in the table.
To evaluate the covariance, isolate the correlation
```{julia}
# m01.σρs extracts the `σρs` property of the model.
# This property is a NamedTuple where the names
# correspond to grouping factors - in this case, `subj`.
# So `m01.σρs.subj.ρ` is the estimated correlation(s) for
# this grouping factor. Because there is only one such correlation
# we can extract it with `only()`, which also verifies that
# there is exactly one.
ρ = only(m01.σρs.subj.ρ)
```
and multiply by the standard deviations
```{julia}
ρ * sqrt(first(Σ) * last(Σ))
```
# The factor is generated from a parameter vector
In practice we optimize the log-likelihood with respect to a parameter vector called $\theta$ that generates $\lambda$.
```{julia}
m01.θ
```
The elements of this parameter vector are subject to constraints.
In particular, two of the three elements have a lower bound of zero.
```{julia}
m01.lowerbd
```
That is, the first and third elements of $\theta$, corresponding to diagonal elements of $\lambda$, must be non-negative, whereas the second component is unconstrained (has a lower bound of $-\infty$).
# Progress of the iterations
The `optsum.fitlog` property of the model is a vector of tuples where each tuple contains the value of the $\theta$ vector and the value of the objective at that $\theta$.
The `fitlog` always contains the first and the last evaluation.
When the `thin` named argument is set, this property has a row for every thin'th evaluation.
```{julia}
m01.optsum.fitlog
```
There were 57 evaluations of the objective before convergence was declared, according to rather stringent convergence criteria.
We can see that the last 10 evaluations only produced changes in the fourth decimal place of the objective or even smaller.
That is, effective convergence occurred after about 40 or 45 evaluations.
# A model that converges to a degenerate distribution
When the iterations converge to a value of $\theta$ that is on the boundary, which means that one of the diagonal elements of $\lambda$ is exactly zero, we say that $\lambda$ is *singular*, in the sense that its inverse does not exist.
Another, more evocative characterization, is that the distribution of the random effects is a *degenerate distribution*, in the sense that the values of the random effects are constrained to a plane (technically, a linear subspace) in the space of all possible random effects.
This is shown in section 3.5.3 of [Embrace Uncertainty](https://www.embraceuncertaintybook.com), especially [Figure 3.13](https://www.embraceuncertaintybook.com/longitudinal.html#fig-bxm03plane).
# Appendix
These are some appendices, mostly so that I can keep track of them
## Evaluating the random effects correlation from $\theta$
There is a short-cut for evaluating the correlation which is to "normalize" the second row of $\lambda$, in the sense that the row is scaled so that it has unit length.
```{julia}
normed = normalize!(λ[2, :])
```
providing the correlation as
```{julia}
first(normed)
```
## Optimizing with a fixed correlation
To profile the correlation we need optimize the objective while fixing a value of the correlation.
The way we will do this is to determine $\theta_2$ as a function of the fixed $\rho$ and $\theta_3$.
We need to solve
$$
\rho = \frac{\theta_2}{\sqrt{\theta_2^2 + \theta_3^2}}
$$
for $\theta_2$ as a function of $\rho$ and $\theta_3$.
Notice that $\theta_2$ and $\rho$ have the same sign.
Thus it is sufficient to determine the absolute value of $\theta_2$ then transfer the sign from $\rho$.
$$
\theta_2^2=\rho^2(\theta_2^2 + \theta_3^2)
$$
which implies
$$
\theta_2^2 = \frac{\rho^2}{1-\rho^2}\theta_3^2, \quad \theta_3\ge 0
$$
and thus
$$
\theta_2=\frac{\rho}{\sqrt{1-\rho^2}}\theta_3
$$