Disable doublet analysis #52

schultzmattd · 2019-09-10T00:03:39Z

For one of our use cases, we use demuxlet to compare a single cell RNA-seq data set to a large number of samples in a VCF. In this instance, we don't care about doublet assignments, but just want to find which cells are singlets and which is the most likely sample. Unfortunately, we run into memory issues when demuxlet tries to find doublets as there are so many pairs of possible samples. It doesn't seem like an option exists to avoid this OOM crash (i.e., skip doublet searching). If it doesn't exist, would it be possible to implement a feature like this? I am happy to try myself and submit a PR, but it's not clear to me where in the codebase such a change would go. Any other tips for VCF files that have a large number of samples would also be appreciated! Thanks in advance.

hyunminkang · 2019-09-10T13:07:24Z

How many cells, SNPs, and individuals are you considering? The doublet search may not be causing the memory errors, so wanted to make sure.. Hyun. ----------------------------------------------------- Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

…

On Mon, Sep 9, 2019 at 8:03 PM Matt Schultz ***@***.***> wrote: For one of our use cases, we use demuxlet to compare a single cell RNA-seq data set to a large number of samples in a VCF. In this instance, we don't care about doublet assignments, but just want to find which cells are singlets and which is the most likely sample. Unfortunately, we run into memory issues when demuxlet tries to find doublets as there are so many pairs of possible samples. It doesn't seem like an option exists to avoid this OOM crash (i.e., skip doublet searching). If it doesn't exist, would it be possible to implement a feature like this? I am happy to try myself and submit a PR, but it's not clear to me where in the codebase such a change would go. Any other tips for VCF files that have a large number of samples would also be appreciated! Thanks in advance. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#52?email_source=notifications&email_token=ABPY5ONAYYGRMCXOQMSJIP3QI3P5ZA5CNFSM4IVBLGZ2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HKJ4UQA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPY5OMR2RNYJQPOHYTLMM3QI3P5ZANCNFSM4IVBLGZQ> .

schultzmattd · 2019-09-10T14:47:37Z

Thanks so much for the quick reply Hyun. I didn't realize a colleague of mine had pointed out the same request on this issue where he pointed out how many cells/SNPs:

~10k cells, ~50 samples (yes, much), ~500k SNPs in my case, memory is ~32 Gb.
(and it worked with ~10k SNPs flawlessly)

hyunminkang · 2019-09-10T14:51:31Z

Does it work with smaller number of samples? I just wanted to make sure that the issue is double detection. Also, https://github.com/statgen/popscle can run demuxlet too, and I suspect that this may result in lower memory footprint, although the preprocessing step may consume quite a bit of memory. Thanks, Hyun. ----------------------------------------------------- Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

…

On Tue, Sep 10, 2019 at 10:47 AM Matt Schultz ***@***.***> wrote: Thanks so much for the quick reply Hyun. I didn't realize a colleague of mine had pointed out the same request on this issue <#37> where he pointed out how many cells/SNPs: ~10k cells, ~50 samples (yes, much), ~500k SNPs in my case, memory is ~32 Gb. (and it worked with ~10k SNPs flawlessly) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#52?email_source=notifications&email_token=ABPY5OIGATNINU4NM6WCLJDQI6XQVA5CNFSM4IVBLGZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6LLP3Y#issuecomment-529971183>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPY5OOARVUU7HKWGAHLMVDQI6XQVANCNFSM4IVBLGZQ> .

schultzmattd · 2019-09-10T14:53:33Z

Yep, we're able to run the workflow on smaller subsets of samples. Not sure exactly where the breakpoint is, but we've run it successfully with that SNP set on that number of cells with 6-8 individuals.

VincentGardeux · 2020-01-31T09:08:58Z

Fix #59 would fix the memory issue.
We tested on ~50 genotypes / 5M snps and it runs without OOM

jamesnemesh · 2020-07-18T16:06:58Z

I'm interested in disabling doublet analysis for a different reason: errors in pool construction.

Let's say your lab has 200 available samples to pool, and you select a set of 50 for your next pool (we run pools of over 100 samples, so this is a pretty trivial number.) You have the expected set of samples, but you'd like to re-identify all of the cells with out prior bias, such that a contamination event or a label/plate swap can be detected. There's no need to identify doublets, as you want to assess which samples are significant contributors to the pool - IE: all samples that have more cells than the expected assignment error rate.

Once you have that list, you can correct your sample list to the correct set of samples, and then run doublet detection on that set.

Is there a way to effectively split up processing? If not, do you detect sample swap errors by running without the --sm-list argument, and then doublet detection runs on all available sample pairs, even though some may not be in the pool? Thanks for your help.

hyunminkang · 2020-07-20T15:44:37Z

I think it is straightforward to disable doublet analysis or speed-up the doublet detection process on some occasions. We will do this in statgen/popscle as that package should have demuxlet implemented and more actively managed. Thanks, Hyun. ----------------------------------------------------- Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

…

On Sat, Jul 18, 2020 at 12:07 PM jamesnemesh ***@***.***> wrote: I'm interested in disabling doublet analysis for a different reason: errors in pool construction. Let's say your lab has 200 available samples to pool, and you select a set of 50 for your next pool (we run pools of over 100 samples, so this is a pretty trivial number.) You have the expected set of samples, but you'd like to re-identify all of the cells with out prior bias, such that a contamination event or a label/plate swap can be detected. There's no need to identify doublets, as you want to assess which samples are significant contributors to the pool - IE: all samples that have more cells than the expected assignment error rate. Once you have that list, you can correct your sample list to the correct set of samples, and then run doublet detection on that set. Is there a way to effectively split up processing? If not, do you detect sample swap errors by running without the --sm-list argument, and then doublet detection runs on all available sample pairs, even though some may not be in the pool? Thanks for your help. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPY5OMCL467AKFQUNVPG3TR4HCC7ANCNFSM4IVBLGZQ> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable doublet analysis #52

Disable doublet analysis #52

schultzmattd commented Sep 10, 2019

hyunminkang commented Sep 10, 2019 via email

schultzmattd commented Sep 10, 2019

hyunminkang commented Sep 10, 2019 via email

schultzmattd commented Sep 10, 2019

VincentGardeux commented Jan 31, 2020

jamesnemesh commented Jul 18, 2020

hyunminkang commented Jul 20, 2020 via email

Disable doublet analysis #52

Disable doublet analysis #52

Comments

schultzmattd commented Sep 10, 2019

hyunminkang commented Sep 10, 2019 via email

schultzmattd commented Sep 10, 2019

hyunminkang commented Sep 10, 2019 via email

schultzmattd commented Sep 10, 2019

VincentGardeux commented Jan 31, 2020

jamesnemesh commented Jul 18, 2020

hyunminkang commented Jul 20, 2020 via email