Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions #2

Closed
mauro3 opened this issue Nov 30, 2016 · 7 comments
Closed

Questions #2

mauro3 opened this issue Nov 30, 2016 · 7 comments

Comments

@mauro3
Copy link
Collaborator

mauro3 commented Nov 30, 2016

Sorry to abuse the issue-tracker for questions: You mention "conversion between regularization parameter λ and degrees of freedom". How would I do that? Is there a way to get the knots where the individual splines join? (or is that the wrong way to look at it entirely?)

@nignatiadis
Copy link
Owner

Hi! I am not sure I completely understand your question, but I will take a stab at it. I think the confusion might be because the term degrees of freedom is "flawed", or at the very least often not formally defined.

It's easiest to motivate this using standard linear regression: Y= X\beta + \varepsilon, where X is your nxp predictor matrix, \beta (px1) is your coefficient vector, \varepsilon is the noise and Y is the observed response.

At least intuitively, the degrees of freedom in this model are p. We would like a more general definition though, that might apply to a larger class of models.

Note that if X has full rank, then the least squares estimate is \hat{\beta} = (X^T X)^{-1}X^T Y and your fitted (predicted) Ys are:

\hat{Y} = X\hat{\beta} = X (X^T X)^{-1} X^T Y

So if we call H = X (X^T X)^{-1} X^T (the so called hat-matrix), then \hat{Y} = HY.

It now turns out that one can show that Trace(H)=p. Hence we take Trace(H) as the definition of degrees of freedom (there are even further generalization but it is enough for smoothing splines).

To see that this is more generally applicable, consider the following: Let λ > 0 and define (I being the identity matrix):

S = X (X^T X + λΙ)^{-1} X^T

and let \hat{Y} = S Y

By adding this diagonal to X^T X, we "damp down" our predictions, i.e. we regularized ("ridge/tikhonov regularization"). Intuitively, this model should have less degrees of freedom than p (and in fact, the larger λ, the smaller the degrees of freedom).

Indeed with the definition df = Trace(S), it will turn out that for λ=0, df = p , but as λ goes to infinity, df goes to 0, so at least this definition makes some sense.

Now what is a smoothing spline? In constrast to a regression spline, here we place knots at every sample point x_i, thus resulting in a design matrix X of dimension n x n! Now imagine using standard linear regression of Y onto X, then (if X full rank, it is invertible), so:

H = X(X^T X)^{-1} X^T = I

\hat{Y} = I Y = Y

Degrees of freedom are equal to n.

This is silly since we just interpolated the data. But now if you do the ridge trick, i.e. add some positive definite matrix V (actually in this case it's not diagonal) to X^T X, i.e. take

S_λ = X( X^T X + λV)^{-1} X^T

and fitted values \hat{Y} = S_λ Y

then as λ grows, your degrees of freedom shrink. It turns out that penalizing the curvature of your fitted function (with regularization parameter λ), corresponds to exactly this type of shrinkage for a matrix V.

Hence given λ, you get your smoother matrix S_λ, and thus the degrees of freedom df = Trace(S_λ). That's a 1-1 mapping and given the degrees of freedom, you can figure out what λ was.

Sometimes people prefer to think of it in one way or another (and degrees of freedom, even though flawed, at least provides one possibility of measuring complexity of different models).

Does this make sense? The element of statistical learning is a great book to learn about these things.

@mauro3
Copy link
Collaborator Author

mauro3 commented Dec 1, 2016

Thanks for your detailed answer! I guess it mostly boils down to my ignorance of statistics, which will stay that way due to time constraints... What I'm after, is trying to understand the lambda parameter and how to set it. The degree-of-freedom is something I, a deterministic modeler, have a grasp on (although, not in the statistical sense).

So how do I calculate Trace(S_λ) from spl?

Maybe another way to ask my question would be: what is the smallest wave-length present in the smoothing spline. (I would expect that the smallest wave-length is somewhere around the smallest distance between used knots.) For example, if I take a noisy sine curve sin(x)+0.1*randn(...) and would like to reproduce it, trial and error gives me a lambda=0.01 .. 1.0. But that lambda depends on the frequency of the sine wave and I need a different lambda fitting sin(x/10) + randn(...). But I expect the DOFs to be the same for both frequencies. So, how do I choose lambda to keep the DOFs similar?

@mauro3
Copy link
Collaborator Author

mauro3 commented Dec 1, 2016

Trial and error gives me a transform lambda = 0.001 * wave_length^3. Could this be right?

@mauro3 mauro3 closed this as completed May 29, 2017
@randy3k
Copy link

randy3k commented Jun 22, 2017

@mauro3

Did you figure out how to extract the smoothing matrix $S_\lambda$?

@mauro3
Copy link
Collaborator Author

mauro3 commented Jun 23, 2017

This
https://github.com/mauro3/VAWTools.jl/blob/2ce38557af2fce8056af79880192209d6636dfb2/src/VAWTools.jl#L1446
is what I use, seems to work. But it was just trial and error.

@randy3k
Copy link

randy3k commented Jun 23, 2017

Thanks for the link. But I didn't mean the trace, but the actually smoothing matrix...

@randy3k
Copy link

randy3k commented Jun 23, 2017

Found it, though it is not very efficient in computing R \ Q' and the inv. (And it doesn't handle ties).

function smoothingmatrix(spl::SmoothingSpline)
    h = diff(spl.Xdesign)
    λ = spl.λ
    Q = SmoothingSplines.ReinschQ(h)
    R = SmoothingSplines.ReinschR(h)
    inv(eye(length(h) + 1) + λ * (Q * (R \ Q')))
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants