-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base-depth become extremely slow when calculating highly covered site. #59
Comments
Thanks for making an issue! I'll check it out in the next few days. Could you please share the version of perbase that you are using? (I think it's in the output of It's been a while since I dug into the perbase code, but it's not impossible that that a pileup of that depth would be slow. A good comparison would be to do a pileup with samtools as a sanity check, since perbase is using htslib under the hood. |
|
I did test samtools mpileup and my customized pipeup script using rust-htslib. The speed is not fast but acceptable. It took 0.5s for samtools mpileup.
|
I waited more than 30 mins, but perbase is still running. I use all the threads, and there is no IO or memory bound. |
So, I think the issue is that you need to specify the region for perbase. It's spinning away doing nothing looking every possible based from the BAM header. If you give it a region it will be faster (seconds):
I do think this is unintuitive and still counts as a bug. Work could also be done to avoid having perbase do the setup / teardown for every TID when it should be able to avoid that from the start. So I'll leave this open for any future work. |
Thanks. I remember I also test with bed file input, and also wait for more than 20min. I will post my test results later. |
Seem that there is some problem with my previous test. I might forget to specify the bed file. I try to run the analysis with a bed file just now, and it finish in ~1min, it is quicker than before but is still too slow. |
I'm pretty sure it's slow because it checks every contig in the bam header, loads up that that contigs reference, spins up resources to process it, then discovers no reads for that contig. Basically, it's not at all optimized for single query work. I'm totally open to PRs on this! And I'll leave this open in the event of future work on Perbase. |
Position 21:8215302 on GRCh38 (Ensembl ref) is used for debugging. When calculating this single site, whose depth is ~10k ,the command took more than 10min and still not finish.
debug.tar.gz
The text was updated successfully, but these errors were encountered: