remove duplicate for single-end read #40

timedreamer · 2019-08-29T19:27:49Z

Hi, I saw a repo used Samblaster for removing single-end duplicate(Repo link).

Is Samblaster suitable for removing single-end reads? If so, does it work similar to Picard-MarkDuplicates?

Thank you for your time.
Ji

GregoryFaust · 2019-09-16T18:32:50Z

Yes, samblaster can be used with single-end reads if you use the --ignoreUnmated option. See also issue 37 for additional suggestions as to how to use this flag.

As to how the results will compare to Picard, the answer is the same for single-end reads as for paired-end reads. samblaster has much higher performance than Picard in terms of speed, but makes the trade-off that the first of a set of duplicates that are found in the input file is kept in the output file, while Picard will keep the "best" of a set of duplicates in the output file. In order to do this, Picard is forced to make two passes over an input file that has been landed to disk (not in a pipe), ergo the slower performance. To read more about this, please see the original samblaster paper.

How much this difference in approach will affect the quality of the output depends on the quality of the reads in the input. The lower the per-base error rate in the input reads, the closer the quality of the results will be between samblaster and Picard. For modern Illumina sequencing, the reads are so uniformly high in quality, that the difference in duplicate identification procedures results in negligible disparity in the quality of reads in the output file. You can read more about this here: SAMBLASTER_Supplemental.pdf. If your reads are from a sequencing technology with a significantly higher per-base error rate, the disparity will be higher. In that case, I suggest you do a run on a trial file and determine whether the trade-off between speed and output reads is the correct one for your application.

timedreamer · 2019-09-17T16:13:51Z

Cool. Very clearly explained. Thanks!

GregoryFaust · 2019-09-18T04:06:21Z

Reopening, as this is not the first time this question has been asked. Someone will undoubtedly want to know in future.

GregoryFaust · 2020-03-17T01:47:23Z

Release 0.1.25 provides better support for using samblaster to mark duplicates in files containing singleton long reads in two ways. First, the algorithm for finding duplicates for singletons was changed to allow forward and reverse strand reads to be duplicates of one another when the rest of the positional information of the alignment matches. Second, a --maxReadLength parameter has been added to fix the issues raised in issue #43. See the release notes for more information.

GregoryFaust closed this as completed Sep 16, 2019

GregoryFaust reopened this Sep 16, 2019

GregoryFaust closed this as completed Sep 16, 2019

GregoryFaust reopened this Sep 16, 2019

timedreamer closed this as completed Sep 17, 2019

GregoryFaust reopened this Sep 18, 2019

pontushojer mentioned this issue Oct 18, 2019

Test alternative duplicate marking programs FrickTobias/BLR#113

Closed

GregoryFaust closed this as completed Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove duplicate for single-end read #40

remove duplicate for single-end read #40

timedreamer commented Aug 29, 2019

GregoryFaust commented Sep 16, 2019 •

edited

Loading

timedreamer commented Sep 17, 2019

GregoryFaust commented Sep 18, 2019

GregoryFaust commented Mar 17, 2020

remove duplicate for single-end read #40

remove duplicate for single-end read #40

Comments

timedreamer commented Aug 29, 2019

GregoryFaust commented Sep 16, 2019 • edited Loading

timedreamer commented Sep 17, 2019

GregoryFaust commented Sep 18, 2019

GregoryFaust commented Mar 17, 2020

GregoryFaust commented Sep 16, 2019 •

edited

Loading