Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster version and a question #46

Open
GabrielHoffman opened this issue Jun 4, 2021 · 5 comments
Open

Faster version and a question #46

GabrielHoffman opened this issue Jun 4, 2021 · 5 comments
Assignees
Milestone

Comments

@GabrielHoffman
Copy link
Contributor

GabrielHoffman commented Jun 4, 2021

Hi Pierre,
Thanks for the developing this package. I have found it really useful for cluttering epigenetic and genetic data.

I am currently applying it to large-scale SNP data. I had an issue with computational time and memory usage, so I removed some bottlenecks that come up with large-scale data. Check out the fork here: https://gabrielhoffman.github.io/adjClustFast/index.html

I'd be happy to contribute this to the main branch if you are interested.

In addition, I have a question about creating discrete clusters from the hierarchical clusters. I have found that current methods can perform pretty poorly, even on simulated data with 4 true clusters below. Is there a way to:

  • interpret the tree height, or maybe average dispersion, to select a natural cutoff
  • apply this cutoff using cutree_chac

Since I am dealing with linkage disequilibrium between SNPs measured by r^2, I'd like to set of cutoff C so that two clusters are split if the average r^2 between the two clusters is < C. But any intuitive criteria would be great.

Cheers,
Gabriel

library(adjclust)
library(Matrix)

# Create correlation matrix with autocorrelation
autocorr.mat <- function(p = 100, rho = 0.9) {
    mat <- diag(p)
    return(rho^abs(row(mat)-col(mat)))
}

set.seed(1)
p = 500 # number of SNPs per block
n_blocks = 2^2 # number of LD blocks

# create correlation matrix
Sigma = autocorr.mat(p, .95)
for( i in 1:log2(n_blocks)){
    Sigma = bdiag(Sigma, Sigma)
}

# Run adjacency constrained clustering
hcl = adjClust(Sigma, "similarity", h=500)

# create discrete clusters
cl1 = select(hcl, type='capushe')
cl2 = select(hcl, type='bstick')

# total number of clusters created by each method
max(cl1)
max(cl2)

# for example using tree height of ~0.9 is reasonable
# but mode="average-disp" normalizes the heights to have a max of 1
plot(hcl, mode="average-disp")

# create 4 clusters
cl3 = cutree_chac(hcl, k=4)

# But in general it is not clear how to interpret *height* of the tree
# and how to use cutree_chac() to use average-disp heights.

# plot LD and clusters
plotSim(as.matrix(Sigma), clustering = cl3, dendro = hcl)
@pneuvial
Copy link
Owner

pneuvial commented Jun 6, 2021

Thanks for all this Gabriel! We'll look into it asap.

@pneuvial
Copy link
Owner

Regarding the first part (speed improvements by passing matL and matR to C++), could you make a pull request on the develop branch? We will be happy to integrate these changes after some tests on our side.

@GabrielHoffman
Copy link
Contributor Author

Yes. Hopefully I'll have some time next week.

@pneuvial
Copy link
Owner

pneuvial commented Jun 11, 2021 via email

@tuxette
Copy link
Collaborator

tuxette commented Sep 15, 2022

en français : implémenter un critère qui

  • calculer (ou récupérer) la similarité moyenne de chaque cluster à tous les niveaux de la hiérarchie
  • garder le plus grand nombre de clusters tq toutes les similarités moyennes sont supérieures à C (user defined)
    (check if that makes sense)

@tuxette tuxette added this to the now milestone Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants