Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standardized residuals return 'inf' #16

Open
TenanATC opened this issue Aug 9, 2024 · 10 comments
Open

standardized residuals return 'inf' #16

TenanATC opened this issue Aug 9, 2024 · 10 comments

Comments

@TenanATC
Copy link

TenanATC commented Aug 9, 2024

sample_dat.csv

It is unclear what is going on but when using some of the attached sample data with a very basic linear model (as well as more complex models not listed here), the standardized residuals are returning 'inf'. As best I can determine, they are all observations where the raw residuals are greater than 0.5191506. Is this a software bug or some sort of expected behavior I've been unable to diagnose?

Here is the basic model:
g1_p1_adj <- gamlss(pace1 ~ sex + age + adj_tot_time, data = sample_dat, family = 'NO')

sum(is.infinity(resid(g1_p1_adj)))

This obviously makes for errors in both the worm plots and the plot.gamlss() method.

*Edit: note that this issue does not occur in the developmental gamlss2 but there is a warning about " non finite quantiles from probabilities, set to NA!" ... so I am guessing there is some sort of 'fix' already developed? but I'd still be interested to know the cause of the 'non finite quantiles'

@zeileis
Copy link
Member

zeileis commented Aug 13, 2024

The reason is that the residuals are huge - about 8.4 to 10.5 - and are computed as quantile residuals. Due to numerical instabilities in the computation, these collapse to infinity here.

As your model is essentially a standard linear regression model, the quantile residuals from gamlss are very similar to the studentized residuals for lm (up to small differences in the kind of standardization):

m <- lm(pace1 ~ sex + age + adj_tot_time, data = sample_dat)
head(rstudent(m))
##           1           2           3           4           5           6 
## -0.55845162 -0.15507614  0.26540229 -0.73697464 -0.05537434  0.18201324 
head(residuals(g1_p1_adj))
##           1           2           3           4           5           6 
## -0.55848057 -0.15506416  0.26543035 -0.73701879 -0.05537774  0.18203188 

So we can get a grasp of what the infinite residuals should be:

i <- which(!is.finite(residuals(g1_p1_adj)))
rstudent(m)[i]
##      1753      1807      2208      7926      8602      8912      9572 
## 10.470657  8.377379  9.146657  9.849901  9.875793 10.441811  8.599326 

And the way that gamlss computes these residuals is essentially:

pnorm(9)
## [1] 1
qnorm(pnorm(9))
## [1] Inf

And this explains the warning in gamlss2. It detects that under the fitted model it is extremely unlikely to observe such extreme residuals and hence sets them to NA rather than Inf.

In this particular case we don't need any randomization in the quantile residuals and thus it would be possible to switch to the log-probability scale:

qnorm(pnorm(9, log.p = TRUE), log.p = TRUE)
## [1] 9

But I'm not sure how easy it would be to switch the computations inside the residuals() method to this. First, I'm not sure whether log.p is available for all distributions. Second, if randomization is needed I'm not sure that this can be done easily on the log-probability scale.

But maybe either Mikis @mstasinopoulos or Niki @freezenik have an idea to avoid this.

@TenanATC
Copy link
Author

This all makes sense to me, and I appreciate the explanation. From a software perspective, it would probably be good to have either an analytical 'solution' (log.p, potentially) or an informative failure and Warning.

@zeileis
Copy link
Member

zeileis commented Aug 14, 2024

In proresiduals() from topmodels, which had exactly the same problem, I've solved the issue now by employing the log.p workaround in case there are infinite quantile residuals (for continuous distributions). My code looks essentially like this:

  if (all(is_continuous(pd))) {
    res <- cdf(pd, y, elementwise = TRUE)
    if (type == "quantile") {
      res <- qnorm(res)
      ## try to catch infinite quantile residuals due to CDF = 0 or = 1
      if (length(ix <- which(!is.finite(res))) > 0L) {
        res[ix] <- qnorm(cdf(pd[ix], y[ix], elementwise = TRUE, log.p = TRUE), log.p = TRUE)
      }
    }
    ## [...]
    return(res)
  }

The reason that I'm only doing this in case of infinite residuals (rather than by default) is that I'm not sure that all cdf() methods support log.p = TRUE.

Mikis @mstasinopoulos and Niki @freezenik, if gamlss.dist always provides the log.p argument, you could also always rely on this.

@mstasinopoulos
Copy link
Collaborator

mstasinopoulos commented Aug 15, 2024 via email

@freezenik
Copy link
Member

Thank you! I just added Achim's @zeileis code snipped and it seems to work.

@zeileis
Copy link
Member

zeileis commented Aug 15, 2024

Thank you both for the follow-up! I had a quick look at both gamlss and gamlss2 but was confused how all the different computations exactly play together. Hence it would be great if you could both have another look at this.

@freezenik
Copy link
Member

Yes, you are right, this is a bit puzzling! The reason why I use rqres() in residuals.gamlss2() is because how some families are implemented in gamlss.dist, e.g., the implementation of the binomial denominator. With the "new" gamlss2 family setup, you don't need it. At the moment, I am not sure what the best solution is?!

@mstasinopoulos
Copy link
Collaborator

mstasinopoulos commented Aug 16, 2024 via email

@zeileis
Copy link
Member

zeileis commented Aug 16, 2024

All good points, thanks. Some further comments:

  • Some of your examples give Inf because internally log.p just computes log(p) rather than something numerically more stable. In these cases the log.p option does not help...but it does not hurt either.
  • For Poisson an observation of 1000 at mu = 1 is so extreme that not even log.p helps.
  • Thus, still catching Inf and issuing a warning might be a good idea.
  • Out-of-sample residuals are also a good idea. We've implemented that in topmodels, see proresiduals.

@zeileis
Copy link
Member

zeileis commented Aug 16, 2024

Niki @freezenik let's talk in person about what the best steps are for gamlss2 when I'm back in Innsbruck (in the last week of August)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants