Faster version and a question #46

GabrielHoffman · 2021-06-04T16:40:36Z

Hi Pierre,
Thanks for the developing this package. I have found it really useful for cluttering epigenetic and genetic data.

I am currently applying it to large-scale SNP data. I had an issue with computational time and memory usage, so I removed some bottlenecks that come up with large-scale data. Check out the fork here: https://gabrielhoffman.github.io/adjClustFast/index.html

I'd be happy to contribute this to the main branch if you are interested.

In addition, I have a question about creating discrete clusters from the hierarchical clusters. I have found that current methods can perform pretty poorly, even on simulated data with 4 true clusters below. Is there a way to:

interpret the tree height, or maybe average dispersion, to select a natural cutoff
apply this cutoff using cutree_chac

Since I am dealing with linkage disequilibrium between SNPs measured by r^2, I'd like to set of cutoff C so that two clusters are split if the average r^2 between the two clusters is < C. But any intuitive criteria would be great.

Cheers,
Gabriel

library(adjclust)
library(Matrix)

# Create correlation matrix with autocorrelation
autocorr.mat <- function(p = 100, rho = 0.9) {
    mat <- diag(p)
    return(rho^abs(row(mat)-col(mat)))
}

set.seed(1)
p = 500 # number of SNPs per block
n_blocks = 2^2 # number of LD blocks

# create correlation matrix
Sigma = autocorr.mat(p, .95)
for( i in 1:log2(n_blocks)){
    Sigma = bdiag(Sigma, Sigma)
}

# Run adjacency constrained clustering
hcl = adjClust(Sigma, "similarity", h=500)

# create discrete clusters
cl1 = select(hcl, type='capushe')
cl2 = select(hcl, type='bstick')

# total number of clusters created by each method
max(cl1)
max(cl2)

# for example using tree height of ~0.9 is reasonable
# but mode="average-disp" normalizes the heights to have a max of 1
plot(hcl, mode="average-disp")

# create 4 clusters
cl3 = cutree_chac(hcl, k=4)

# But in general it is not clear how to interpret *height* of the tree
# and how to use cutree_chac() to use average-disp heights.

# plot LD and clusters
plotSim(as.matrix(Sigma), clustering = cl3, dendro = hcl)

The text was updated successfully, but these errors were encountered:

pneuvial · 2021-06-06T22:10:08Z

Thanks for all this Gabriel! We'll look into it asap.

pneuvial · 2021-06-11T13:43:16Z

Regarding the first part (speed improvements by passing matL and matR to C++), could you make a pull request on the develop branch? We will be happy to integrate these changes after some tests on our side.

GabrielHoffman · 2021-06-11T14:01:50Z

Yes. Hopefully I'll have some time next week.

pneuvial · 2021-06-11T15:01:32Z

Great! For the second part (choosing the number of clusters): in your simulation the within-block dependency is AR(1) (with parameter 0.9). Indeed the model selection methods currently implemented in adjclust fail in this scenario. I don't know if it this scenario is a good proxy for LD(r^2) similarity? In any case it would of course be nice to have a method for cutting the tree that performs satistfactorily in this simple scenario. Another method we considered a while ago (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0556-6) is the Gap statistic (https://web.stanford.edu/%7Ehastie/Papers/gap.pdf). I don't know if it would work well here but it is computationally intensive as it requires performing the clustering on a large number of perturbed data sets. We will try to look at your suggestion to perform splitting based on the average similarity between clusters over the summer, but we'll need some time for this. Your suggestions are of course welcome.

tuxette · 2022-09-15T10:19:17Z

en français : implémenter un critère qui

calculer (ou récupérer) la similarité moyenne de chaque cluster à tous les niveaux de la hiérarchie
garder le plus grand nombre de clusters tq toutes les similarités moyennes sont supérieures à C (user defined)
(check if that makes sense)

tuxette assigned pneuvial Sep 15, 2022

tuxette added this to the now milestone Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster version and a question #46

Faster version and a question #46

GabrielHoffman commented Jun 4, 2021 •

edited

Loading

pneuvial commented Jun 6, 2021

pneuvial commented Jun 11, 2021

GabrielHoffman commented Jun 11, 2021

pneuvial commented Jun 11, 2021 via email •

edited

Loading

tuxette commented Sep 15, 2022

Faster version and a question #46

Faster version and a question #46

Comments

GabrielHoffman commented Jun 4, 2021 • edited Loading

pneuvial commented Jun 6, 2021

pneuvial commented Jun 11, 2021

GabrielHoffman commented Jun 11, 2021

pneuvial commented Jun 11, 2021 via email • edited Loading

tuxette commented Sep 15, 2022

GabrielHoffman commented Jun 4, 2021 •

edited

Loading

pneuvial commented Jun 11, 2021 via email •

edited

Loading