standardized residuals return 'inf' #16

TenanATC · 2024-08-09T17:30:19Z

It is unclear what is going on but when using some of the attached sample data with a very basic linear model (as well as more complex models not listed here), the standardized residuals are returning 'inf'. As best I can determine, they are all observations where the raw residuals are greater than 0.5191506. Is this a software bug or some sort of expected behavior I've been unable to diagnose?

Here is the basic model:
g1_p1_adj <- gamlss(pace1 ~ sex + age + adj_tot_time, data = sample_dat, family = 'NO')

sum(is.infinity(resid(g1_p1_adj)))

This obviously makes for errors in both the worm plots and the plot.gamlss() method.

*Edit: note that this issue does not occur in the developmental gamlss2 but there is a warning about " non finite quantiles from probabilities, set to NA!" ... so I am guessing there is some sort of 'fix' already developed? but I'd still be interested to know the cause of the 'non finite quantiles'

zeileis · 2024-08-13T21:59:57Z

The reason is that the residuals are huge - about 8.4 to 10.5 - and are computed as quantile residuals. Due to numerical instabilities in the computation, these collapse to infinity here.

As your model is essentially a standard linear regression model, the quantile residuals from gamlss are very similar to the studentized residuals for lm (up to small differences in the kind of standardization):

m <- lm(pace1 ~ sex + age + adj_tot_time, data = sample_dat)
head(rstudent(m))
##           1           2           3           4           5           6 
## -0.55845162 -0.15507614  0.26540229 -0.73697464 -0.05537434  0.18201324 
head(residuals(g1_p1_adj))
##           1           2           3           4           5           6 
## -0.55848057 -0.15506416  0.26543035 -0.73701879 -0.05537774  0.18203188

So we can get a grasp of what the infinite residuals should be:

i <- which(!is.finite(residuals(g1_p1_adj)))
rstudent(m)[i]
##      1753      1807      2208      7926      8602      8912      9572 
## 10.470657  8.377379  9.146657  9.849901  9.875793 10.441811  8.599326

And the way that gamlss computes these residuals is essentially:

pnorm(9)
## [1] 1
qnorm(pnorm(9))
## [1] Inf

And this explains the warning in gamlss2. It detects that under the fitted model it is extremely unlikely to observe such extreme residuals and hence sets them to NA rather than Inf.

In this particular case we don't need any randomization in the quantile residuals and thus it would be possible to switch to the log-probability scale:

qnorm(pnorm(9, log.p = TRUE), log.p = TRUE)
## [1] 9

But I'm not sure how easy it would be to switch the computations inside the residuals() method to this. First, I'm not sure whether log.p is available for all distributions. Second, if randomization is needed I'm not sure that this can be done easily on the log-probability scale.

But maybe either Mikis @mstasinopoulos or Niki @freezenik have an idea to avoid this.

TenanATC · 2024-08-14T13:19:29Z

This all makes sense to me, and I appreciate the explanation. From a software perspective, it would probably be good to have either an analytical 'solution' (log.p, potentially) or an informative failure and Warning.

zeileis · 2024-08-14T22:30:17Z

In proresiduals() from topmodels, which had exactly the same problem, I've solved the issue now by employing the log.p workaround in case there are infinite quantile residuals (for continuous distributions). My code looks essentially like this:

  if (all(is_continuous(pd))) {
    res <- cdf(pd, y, elementwise = TRUE)
    if (type == "quantile") {
      res <- qnorm(res)
      ## try to catch infinite quantile residuals due to CDF = 0 or = 1
      if (length(ix <- which(!is.finite(res))) > 0L) {
        res[ix] <- qnorm(cdf(pd[ix], y[ix], elementwise = TRUE, log.p = TRUE), log.p = TRUE)
      }
    }
    ## [...]
    return(res)
  }

The reason that I'm only doing this in case of infinite residuals (rather than by default) is that I'm not sure that all cdf() methods support log.p = TRUE.

Mikis @mstasinopoulos and Niki @freezenik, if gamlss.dist always provides the log.p argument, you could also always rely on this.

mstasinopoulos · 2024-08-15T07:57:08Z

Hi Achim I am almost sure that all cdf’s in `gamlss.dist` have the log.p argument so we should use the p.log trick to calculated residuals Thanks Mikis

…

On 13 Aug 2024, at 23:00, Achim Zeileis ***@***.***> wrote: qnorm(pnorm(9, log.p = TRUE), log.p = TRUE)

freezenik · 2024-08-15T08:43:15Z

Thank you! I just added Achim's @zeileis code snipped and it seems to work.

zeileis · 2024-08-15T22:36:29Z

Thank you both for the follow-up! I had a quick look at both gamlss and gamlss2 but was confused how all the different computations exactly play together. Hence it would be great if you could both have another look at this.

First, I think that in the rqres() function in gamlss the line https://github.com/gamlss-dev/gamlss/blob/main/R/rqres.R#L19 could be changed to
```
rqres <- qnorm(cdf(q = y, ..., log.p = TRUE), log.p = TRUE)
```
if indeed all p... functions have a log.p argument.
Then this rqres() computation seems to be inherited in various places - but I didn't check whether changes are needed in other places as well.
Even gamlss2 seems to inherit this computation in https://github.com/gamlss-dev/gamlss2/blob/main/R/families.R#L526 which seems somewhat quick & dirty to me, given that gamlss2 is only a Suggests dependency.
It's good to have Niki's new fix in https://github.com/gamlss-dev/gamlss2/blob/main/R/residuals.R#L119-L121 as an additional fallback. But this is only applicable like this in the continuous case without randomization.

freezenik · 2024-08-16T07:01:46Z

Yes, you are right, this is a bit puzzling! The reason why I use rqres() in residuals.gamlss2() is because how some families are implemented in gamlss.dist, e.g., the implementation of the binomial denominator. With the "new" gamlss2 family setup, you don't need it. At the moment, I am not sure what the best solution is?!

mstasinopoulos · 2024-08-16T08:23:42Z

Dear Niki and Achim After our emails last yesterday I had a second thought about log.p in gamlss.dist. I am almost sure that its implementation for the `p ` function is probably correct I feel that its implementation in the `q ` function needs checking. Mind you this will not effect

rqres <- qnorm(cdf(q = y, ..., log.p = TRUE), log.p = TRUE)

Since `qnorm` is OK We need to think what is appropriate for `randomised` quantile residuals Those day I call quantile residuals the `z-scores` but I am not sure if we should adopt the name or not. The rqres() function is used also in other functions of `gamlss` related packages but it is used through the body() function i.e. body(rqres) <- eval(quote(body(rqres)), envir = getNamespace("gamlss”)) Which mean that we have to change `rqres()` only once Here is some example of the `qnorm(cdf(q = y, ..., log.p = TRUE), log.p = TRUE)` paradigm qnorm(pTF(q = 1000, log.p = TRUE), log.p = TRUE) [1] 10.61788

qnorm(pBCT(q = 1000, log.p = TRUE), log.p = TRUE)

[1] 5.154898

qnorm(pBCPE(q = 1000, log.p = TRUE), log.p = TRUE)

[1] Inf

qnorm(pBCCG(q = 1000, log.p = TRUE), log.p = TRUE)

[1] Inf

qnorm(pPO(q = 1000, log.p = TRUE), log.p = TRUE)

[1] Inf

qnorm(pSI(q = 1000, log.p = TRUE), log.p = TRUE)

[1] 7.679371 It looks that occasionally will get `Inf`. Is that OK for graphs? what happens to graphs if encounter `Inf`? NA usually is OK because are omitted. All this discussion about residuals remind me something else I would like for `gamlss2` which I think it will be very useful resid( , newdata) I think it will be very useful to have the residuals for out of bag data. Thanks Mikis

…

On 15 Aug 2024, at 23:36, Achim Zeileis ***@***.***> wrote: Thank you both for the follow-up! I had a quick look at both gamlss and gamlss2 but was confused how all the different computations exactly play together. Hence it would be great if you could both have another look at this. First, I think that in the rqres() function in gamlss the line https://github.com/gamlss-dev/gamlss/blob/main/R/rqres.R#L19 could be changed to rqres <- qnorm(cdf(q = y, ..., log.p = TRUE), log.p = TRUE) if indeed all p... functions have a log.p argument. Then this rqres() computation seems to be inherited in various places - but I didn't check whether changes are needed in other places as well. Even gamlss2 seems to inherit this computation in https://github.com/gamlss-dev/gamlss2/blob/main/R/families.R#L526 which seems somewhat quick & dirty to me, given that gamlss2 is only a Suggests dependency. It's good to have Niki's new fix in https://github.com/gamlss-dev/gamlss2/blob/main/R/residuals.R#L119-L121 as an additional fallback. But this is only applicable like this in the continuous case without randomization. — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE7QN7SARUPYHEMWGS27UQLZRUUQHAVCNFSM6AAAAABMI3NP5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJSGQYDKOJTGQ>. You are receiving this because you were mentioned.

zeileis · 2024-08-16T23:31:27Z

All good points, thanks. Some further comments:

Some of your examples give Inf because internally log.p just computes log(p) rather than something numerically more stable. In these cases the log.p option does not help...but it does not hurt either.
For Poisson an observation of 1000 at mu = 1 is so extreme that not even log.p helps.
Thus, still catching Inf and issuing a warning might be a good idea.
Out-of-sample residuals are also a good idea. We've implemented that in topmodels, see proresiduals.

zeileis · 2024-08-16T23:37:26Z

Niki @freezenik let's talk in person about what the best steps are for gamlss2 when I'm back in Innsbruck (in the last week of August)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

standardized residuals return 'inf' #16

standardized residuals return 'inf' #16

TenanATC commented Aug 9, 2024 •

edited

Loading

zeileis commented Aug 13, 2024

TenanATC commented Aug 14, 2024

zeileis commented Aug 14, 2024

mstasinopoulos commented Aug 15, 2024 via email

freezenik commented Aug 15, 2024

zeileis commented Aug 15, 2024

freezenik commented Aug 16, 2024

mstasinopoulos commented Aug 16, 2024 via email

zeileis commented Aug 16, 2024

zeileis commented Aug 16, 2024

standardized residuals return 'inf' #16

standardized residuals return 'inf' #16

Comments

TenanATC commented Aug 9, 2024 • edited Loading

zeileis commented Aug 13, 2024

TenanATC commented Aug 14, 2024

zeileis commented Aug 14, 2024

mstasinopoulos commented Aug 15, 2024 via email

freezenik commented Aug 15, 2024

zeileis commented Aug 15, 2024

freezenik commented Aug 16, 2024

mstasinopoulos commented Aug 16, 2024 via email

zeileis commented Aug 16, 2024

zeileis commented Aug 16, 2024

TenanATC commented Aug 9, 2024 •

edited

Loading