Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable doublet analysis #52

Open
schultzmattd opened this issue Sep 10, 2019 · 7 comments
Open

Disable doublet analysis #52

schultzmattd opened this issue Sep 10, 2019 · 7 comments

Comments

@schultzmattd
Copy link

For one of our use cases, we use demuxlet to compare a single cell RNA-seq data set to a large number of samples in a VCF. In this instance, we don't care about doublet assignments, but just want to find which cells are singlets and which is the most likely sample. Unfortunately, we run into memory issues when demuxlet tries to find doublets as there are so many pairs of possible samples. It doesn't seem like an option exists to avoid this OOM crash (i.e., skip doublet searching). If it doesn't exist, would it be possible to implement a feature like this? I am happy to try myself and submit a PR, but it's not clear to me where in the codebase such a change would go. Any other tips for VCF files that have a large number of samples would also be appreciated! Thanks in advance.

@hyunminkang
Copy link
Contributor

hyunminkang commented Sep 10, 2019 via email

@schultzmattd
Copy link
Author

Thanks so much for the quick reply Hyun. I didn't realize a colleague of mine had pointed out the same request on this issue where he pointed out how many cells/SNPs:

~10k cells, ~50 samples (yes, much), ~500k SNPs in my case, memory is ~32 Gb.
(and it worked with ~10k SNPs flawlessly)

@hyunminkang
Copy link
Contributor

hyunminkang commented Sep 10, 2019 via email

@schultzmattd
Copy link
Author

Yep, we're able to run the workflow on smaller subsets of samples. Not sure exactly where the breakpoint is, but we've run it successfully with that SNP set on that number of cells with 6-8 individuals.

@VincentGardeux
Copy link

Fix #59 would fix the memory issue.
We tested on ~50 genotypes / 5M snps and it runs without OOM

@jamesnemesh
Copy link

I'm interested in disabling doublet analysis for a different reason: errors in pool construction.

Let's say your lab has 200 available samples to pool, and you select a set of 50 for your next pool (we run pools of over 100 samples, so this is a pretty trivial number.) You have the expected set of samples, but you'd like to re-identify all of the cells with out prior bias, such that a contamination event or a label/plate swap can be detected. There's no need to identify doublets, as you want to assess which samples are significant contributors to the pool - IE: all samples that have more cells than the expected assignment error rate.

Once you have that list, you can correct your sample list to the correct set of samples, and then run doublet detection on that set.

Is there a way to effectively split up processing? If not, do you detect sample swap errors by running without the --sm-list argument, and then doublet detection runs on all available sample pairs, even though some may not be in the pool? Thanks for your help.

@hyunminkang
Copy link
Contributor

hyunminkang commented Jul 20, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants