Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove duplicate for single-end read #40

Closed
timedreamer opened this issue Aug 29, 2019 · 4 comments
Closed

remove duplicate for single-end read #40

timedreamer opened this issue Aug 29, 2019 · 4 comments

Comments

@timedreamer
Copy link

Hi, I saw a repo used Samblaster for removing single-end duplicate(Repo link).

Is Samblaster suitable for removing single-end reads? If so, does it work similar to Picard-MarkDuplicates?

Thank you for your time.
Ji

@GregoryFaust
Copy link
Owner

GregoryFaust commented Sep 16, 2019

Yes, samblaster can be used with single-end reads if you use the --ignoreUnmated option. See also issue 37 for additional suggestions as to how to use this flag.

As to how the results will compare to Picard, the answer is the same for single-end reads as for paired-end reads. samblaster has much higher performance than Picard in terms of speed, but makes the trade-off that the first of a set of duplicates that are found in the input file is kept in the output file, while Picard will keep the "best" of a set of duplicates in the output file. In order to do this, Picard is forced to make two passes over an input file that has been landed to disk (not in a pipe), ergo the slower performance. To read more about this, please see the original samblaster paper.

How much this difference in approach will affect the quality of the output depends on the quality of the reads in the input. The lower the per-base error rate in the input reads, the closer the quality of the results will be between samblaster and Picard. For modern Illumina sequencing, the reads are so uniformly high in quality, that the difference in duplicate identification procedures results in negligible disparity in the quality of reads in the output file. You can read more about this here: SAMBLASTER_Supplemental.pdf. If your reads are from a sequencing technology with a significantly higher per-base error rate, the disparity will be higher. In that case, I suggest you do a run on a trial file and determine whether the trade-off between speed and output reads is the correct one for your application.

@timedreamer
Copy link
Author

Cool. Very clearly explained. Thanks!

@GregoryFaust
Copy link
Owner

Reopening, as this is not the first time this question has been asked. Someone will undoubtedly want to know in future.

@GregoryFaust
Copy link
Owner

Release 0.1.25 provides better support for using samblaster to mark duplicates in files containing singleton long reads in two ways. First, the algorithm for finding duplicates for singletons was changed to allow forward and reverse strand reads to be duplicates of one another when the rest of the positional information of the alignment matches. Second, a --maxReadLength parameter has been added to fix the issues raised in issue #43. See the release notes for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants