-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
abs in centroid distance calculation #306
Comments
The vegan behaviour is documented (with an equation) in the |
@gavinsimpson may remember more details, but we had a long discussion about this behaviour before implementation. However, it seems to be so ancient that the commits were before version control. |
@itcarroll thanks for the question & example. As @jarioksa mentions, we followed Marti Anderson's Biometrics paper closely, implementing the equations presented there. It has been quite some time since I looked at that paper but I'll take a look next week and provide more detail if it jogs my memory, but I'd suggest you also take a look at it and if an issue remains we can revisit this. Part of the difficulty here is the Primer-e is not open source and I am unsure what they actually implement; I know there are some newer developments that Marti has included in Primer that I am not aware have been published so it is further difficult to know what is being done with that software. |
Thank you, @gavinsimpson, for offering to look further into the question. I am familiar with Anderson's 2006 paper, and I'm pretty sure @jarioksa is referring to equation 3. The equation does not include taking an absolute value before taking the square root. In fact, on page 248, Anderson writes:
I think the example I've given above is one "highly unlikely" case, and am interested to know whether you find it so too. It is unfortunate that Primer-e is not open source, and while we cannot see what the code does we see the result. I've run this matrix through Primer-e, and the result is consistent with simply dropping the imaginary part in the centroid distances. My points 1 and 2 above address specific inconsistencies between the |
@gavinsimpson it seems that we talked about the negative eigenvalues in 2009, but my archive is very patchy. However, ignoring negative eigenvalues and only using the real eigenvectors is clearly wrong (@itcarroll). The point is only how to handle the negative eigenvalues. In all decent vegan functions we handle negative eigenvalues & complex eigenvectors ( |
Not totally sure yet that the specific concern I'm raising is understood, so to be clear, I do not suggest ignoring components associated with negative eigenvalues. I'm interested in the case where |
@itcarroll : things come back to my mind by and by. We really had a discussion on exactly this issue (neg distances larger than pos ones), but that was back in 2008 which is older than my email archive and older than having vegan in version control. However, there is an old
I am quite sure that @gavinsimpson did the background work here (and I think all the work), and we also had a longer discussion about this issue, but I can't find any email from 2008 on that subject to see what kind of background work we had. I think Gavin consulted Marti, but I really cannot give any supporting evidence. |
The Primer window banner seems to refer this analysis as PERMDISP1, whereas our ChangeLog entry was about PERMDISP2 compatibility. I have no idea where these alternative PERMDISPs are, but how do their outputs compare? (I don't have a Primer licence nor a Windows computer.) |
Good question. Best I can tell, the 1 corresponds to "Data1" and "Resem1" which is just the default name given to windows in this workspace. If I load up a second workspace I get "Data2" and "PERMDISP2". Lord help us if that changes the underlying algorithm. I cannot find a reference in the software manual to the different versions of PERMDISP. |
A fourth option is to return a complex valued "distance". An argument for doing so is that the result appears to agree with what Anderson calls Huygen's Theorem as applied to euclidean distances. An argument against doing so is that it's uninterpretable. |
@itcarroll : I had a look at the issue yesterday, and after sleeping over the night I think you are right: the distance should be complex valued. This is indeed unfortunate. In general, we want methods to help people to understand their data, and returning imaginary distances is not very helpful. You already have an n-dimensional space, but then a point finds a wormhole in that space and shoots into imaginary space. In the one-class case, the sum of squared distances should be equal to the sum of eigenvalues in > x
[,1] [,2]
[1,] 0 2
[2,] 1 2
[3,] 1 0
> m <- betadisper(vegdist(x), rep("a",3), type="centr")
> m$eig
PCoA1 PCoA2
0.50637605 -0.07637605
> sum(m$eig)
[1] 0.43
> m$distances # 2nd distance should be imaginary
[1] 0.4509250 0.2160247 0.5228129
> sum(m$distances^2)
[1] 0.5233333
> sum(m$distances[c(1,3)]^2) - m$distances[2]^2 # 2nd square should be negative
[1] 0.43 The absolute value of that imaginary component that we report is correct, but we do not give information that it should be complex-valued, and this is not correct. Giving that value as zero is incorrect as well. Giving it as I think that something must be changed in |
While hoping to hear better ideas from Gavin, I've been wondering ... Is there any hope that Setup the problem with a second (well-behaved) group: df <- read.csv(text = '
Group, SpA, SpB
1, 0, 2
1, 1, 2
1, 1, 0
2, 0, 2
2, 1, 2')
dst <- vegdist(df[,-1])
dsp <- betadisper(dst, df$Group, 'centroid')
dsp$distances[[2]] <- dsp$distances[[2]] * 1i ## the "correction" Calculate the usual one-way ANOVA F statistic but with # means (replicated to length y)
y <- dsp[['distances']]
n <- aggregate(y, dsp['group'], length)[[2]]
y.. <- mean(y) %>% mapply(rep, ., n) %>% unlist
yk. <- aggregate(y, dsp['group'], mean)[[2]] %>% mapply(rep, ., n) %>% unlist
# explained
mdl <- yk. - y..
XSS <- sum(Re(mdl)^2 - Im(mdl)^2)
MXSS <- XSS / (length(n) - 1)
# residual
rsd <- y - yk.
RSS <- sum(Re(rsd)^2 - Im(rsd)^2)
MRSS <- RSS / (sum(n) - length(n)) Could the mean sum of squares ratio (here *uh oh. this kind of looks like https://en.wikipedia.org/wiki/Minkowski_space... |
It is obvious that The easiest way is to use zero (0), and probably with a warning about negative squared distances. The current practice of taking Using Complex valued output makes sense, but leaves people in trouble since they should be able to handle complex vectors. Currently support functions do not handle them and they all must be re-written. The complex distances arise when we have negative squared distances. If we can do with squared distances, then we avoid the explicit problem. As @itcarroll demonstrates above, some of the functions (parametric My suggestion is to first change negative squared distances to zero distances, and then at the second stage consider should we change some support functions for negative squared distances. |
Some squared distances to centroid can be negative giving complex valued distances. We used to take the Mod of these complex value (or sqrt of abs differences) which replaces these "more similar than identical" distances with a positive value inflating estimates of within-group distances. Now we take only the real part (or zero) for these distances. This is still biased, but less so than the old practice.
I pushed branch issue-#306 to the github. Please work on this if you want to develop the approach I took. If you want something radically different, publish a new alternative branch. Nothing merged yet. |
I think your approach is a good plan. My only suggestion is to consider issuing a warning when negative squared distances are encountered. |
My suggestion issues a warning on negative squared distances. @itcarroll, do you suggest I should re-consider this and remove the warning? (I do know – life teaches you – that this warning will cause confusion and many email messages to me.) |
Sorry @jarioksa, I had not looked at your branch. Carry on (and sorry for the emails)! |
Some squared distances to centroid can be negative giving complex valued distances. We used to take the Mod of these complex value (or sqrt of abs differences) which replaces these "more similar than identical" distances with a positive value inflating estimates of within-group distances. Now we take only the real part (or zero) for these distances. This is still biased, but less so than the old practice. (cherry picked from commit 48b616e)
My colleague @mavolio and I have noticed that the betadisper function can give a result that differs from Primer-e. This situation arises when the input distance matrix violates the triangle inequality, for example:
The "distance" from 1 to 3 is 1.0, but it's shorter to go the 0.2 to 2 and then 0.5 to 3. The distances to the group centroid that
betadisper
returns are all real values, due to theabs
call in L130.Without that
abs
, the result would appear to be meaningless because it is complex valued:It's not clear to us that a correct resolution of this situation is described in the literature anywhere. We think the result returned by betadisper may be faulty for two reasons:
The second point arises from our reading of page 17 of the Primer-e manual. The sum of squares of the complex array above (0.43+0i) does pass the test, so despite it being complex may be the correct answer. We currently plan, in an upcoming patch to the NCEAS/codyn package, to return NA for centroid distances that fall into this category, because we don't feel the topic has passed through peer-review. How did the vegan developers settle on the current approach of taking an absolute value?
The text was updated successfully, but these errors were encountered: