-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster version and a question #46
Milestone
Comments
Thanks for all this Gabriel! We'll look into it asap. |
Regarding the first part (speed improvements by passing matL and matR to C++), could you make a pull request on the develop branch? We will be happy to integrate these changes after some tests on our side. |
Yes. Hopefully I'll have some time next week. |
Great!
For the second part (choosing the number of clusters): in your
simulation the within-block dependency is AR(1) (with parameter 0.9).
Indeed the model selection methods currently implemented in adjclust
fail in this scenario. I don't know if it this scenario is a good
proxy for LD(r^2) similarity? In any case it would of course be nice
to have a method for cutting the tree that performs satistfactorily in
this simple scenario.
Another method we considered a while ago
(https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0556-6)
is the Gap statistic
(https://web.stanford.edu/%7Ehastie/Papers/gap.pdf). I don't know if
it would work well here but it is computationally intensive as it
requires performing the clustering on a large number of perturbed data
sets.
We will try to look at your suggestion to perform splitting based on
the average similarity between clusters over the summer, but we'll
need some time for this. Your suggestions are of course welcome.
|
en français : implémenter un critère qui
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi Pierre,
Thanks for the developing this package. I have found it really useful for cluttering epigenetic and genetic data.
I am currently applying it to large-scale SNP data. I had an issue with computational time and memory usage, so I removed some bottlenecks that come up with large-scale data. Check out the fork here: https://gabrielhoffman.github.io/adjClustFast/index.html
I'd be happy to contribute this to the main branch if you are interested.
In addition, I have a question about creating discrete clusters from the hierarchical clusters. I have found that current methods can perform pretty poorly, even on simulated data with 4 true clusters below. Is there a way to:
cutree_chac
Since I am dealing with linkage disequilibrium between SNPs measured by r^2, I'd like to set of cutoff
C
so that two clusters are split if the average r^2 between the two clusters is <C
. But any intuitive criteria would be great.Cheers,
Gabriel
The text was updated successfully, but these errors were encountered: