-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable doublet analysis #52
Comments
How many cells, SNPs, and individuals are you considering? The doublet
search may not be causing the memory errors, so wanted to make sure..
Hyun.
-----------------------------------------------------
Hyun Min Kang, Ph.D.
Associate Professor of Biostatistics
University of Michigan, Ann Arbor
Email : hmkang@umich.edu
…On Mon, Sep 9, 2019 at 8:03 PM Matt Schultz ***@***.***> wrote:
For one of our use cases, we use demuxlet to compare a single cell RNA-seq
data set to a large number of samples in a VCF. In this instance, we don't
care about doublet assignments, but just want to find which cells are
singlets and which is the most likely sample. Unfortunately, we run into
memory issues when demuxlet tries to find doublets as there are so many
pairs of possible samples. It doesn't seem like an option exists to avoid
this OOM crash (i.e., skip doublet searching). If it doesn't exist, would
it be possible to implement a feature like this? I am happy to try myself
and submit a PR, but it's not clear to me where in the codebase such a
change would go. Any other tips for VCF files that have a large number of
samples would also be appreciated! Thanks in advance.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#52?email_source=notifications&email_token=ABPY5ONAYYGRMCXOQMSJIP3QI3P5ZA5CNFSM4IVBLGZ2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HKJ4UQA>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPY5OMR2RNYJQPOHYTLMM3QI3P5ZANCNFSM4IVBLGZQ>
.
|
Thanks so much for the quick reply Hyun. I didn't realize a colleague of mine had pointed out the same request on this issue where he pointed out how many cells/SNPs:
|
Does it work with smaller number of samples? I just wanted to make sure
that the issue is double detection.
Also, https://github.com/statgen/popscle can run demuxlet too, and I
suspect that this may result in lower memory footprint, although the
preprocessing step may consume quite a bit of memory.
Thanks,
Hyun.
-----------------------------------------------------
Hyun Min Kang, Ph.D.
Associate Professor of Biostatistics
University of Michigan, Ann Arbor
Email : hmkang@umich.edu
…On Tue, Sep 10, 2019 at 10:47 AM Matt Schultz ***@***.***> wrote:
Thanks so much for the quick reply Hyun. I didn't realize a colleague of
mine had pointed out the same request on this issue
<#37> where he pointed out how
many cells/SNPs:
~10k cells, ~50 samples (yes, much), ~500k SNPs in my case, memory is ~32 Gb.
(and it worked with ~10k SNPs flawlessly)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#52?email_source=notifications&email_token=ABPY5OIGATNINU4NM6WCLJDQI6XQVA5CNFSM4IVBLGZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6LLP3Y#issuecomment-529971183>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPY5OOARVUU7HKWGAHLMVDQI6XQVANCNFSM4IVBLGZQ>
.
|
Yep, we're able to run the workflow on smaller subsets of samples. Not sure exactly where the breakpoint is, but we've run it successfully with that SNP set on that number of cells with 6-8 individuals. |
Fix #59 would fix the memory issue. |
I'm interested in disabling doublet analysis for a different reason: errors in pool construction. Let's say your lab has 200 available samples to pool, and you select a set of 50 for your next pool (we run pools of over 100 samples, so this is a pretty trivial number.) You have the expected set of samples, but you'd like to re-identify all of the cells with out prior bias, such that a contamination event or a label/plate swap can be detected. There's no need to identify doublets, as you want to assess which samples are significant contributors to the pool - IE: all samples that have more cells than the expected assignment error rate. Once you have that list, you can correct your sample list to the correct set of samples, and then run doublet detection on that set. Is there a way to effectively split up processing? If not, do you detect sample swap errors by running without the --sm-list argument, and then doublet detection runs on all available sample pairs, even though some may not be in the pool? Thanks for your help. |
I think it is straightforward to disable doublet analysis or speed-up the
doublet detection process on some occasions. We will do this in
statgen/popscle as that package should have demuxlet implemented and more
actively managed.
Thanks,
Hyun.
-----------------------------------------------------
Hyun Min Kang, Ph.D.
Associate Professor of Biostatistics
University of Michigan, Ann Arbor
Email : hmkang@umich.edu
…On Sat, Jul 18, 2020 at 12:07 PM jamesnemesh ***@***.***> wrote:
I'm interested in disabling doublet analysis for a different reason:
errors in pool construction.
Let's say your lab has 200 available samples to pool, and you select a set
of 50 for your next pool (we run pools of over 100 samples, so this is a
pretty trivial number.) You have the expected set of samples, but you'd
like to re-identify all of the cells with out prior bias, such that a
contamination event or a label/plate swap can be detected. There's no need
to identify doublets, as you want to assess which samples are significant
contributors to the pool - IE: all samples that have more cells than the
expected assignment error rate.
Once you have that list, you can correct your sample list to the correct
set of samples, and then run doublet detection on that set.
Is there a way to effectively split up processing? If not, do you detect
sample swap errors by running without the --sm-list argument, and then
doublet detection runs on all available sample pairs, even though some may
not be in the pool? Thanks for your help.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPY5OMCL467AKFQUNVPG3TR4HCC7ANCNFSM4IVBLGZQ>
.
|
For one of our use cases, we use demuxlet to compare a single cell RNA-seq data set to a large number of samples in a VCF. In this instance, we don't care about doublet assignments, but just want to find which cells are singlets and which is the most likely sample. Unfortunately, we run into memory issues when demuxlet tries to find doublets as there are so many pairs of possible samples. It doesn't seem like an option exists to avoid this OOM crash (i.e., skip doublet searching). If it doesn't exist, would it be possible to implement a feature like this? I am happy to try myself and submit a PR, but it's not clear to me where in the codebase such a change would go. Any other tips for VCF files that have a large number of samples would also be appreciated! Thanks in advance.
The text was updated successfully, but these errors were encountered: