-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove duplicate for single-end read #40
Comments
Yes, samblaster can be used with single-end reads if you use the --ignoreUnmated option. See also issue 37 for additional suggestions as to how to use this flag. As to how the results will compare to Picard, the answer is the same for single-end reads as for paired-end reads. samblaster has much higher performance than Picard in terms of speed, but makes the trade-off that the first of a set of duplicates that are found in the input file is kept in the output file, while Picard will keep the "best" of a set of duplicates in the output file. In order to do this, Picard is forced to make two passes over an input file that has been landed to disk (not in a pipe), ergo the slower performance. To read more about this, please see the original samblaster paper. How much this difference in approach will affect the quality of the output depends on the quality of the reads in the input. The lower the per-base error rate in the input reads, the closer the quality of the results will be between samblaster and Picard. For modern Illumina sequencing, the reads are so uniformly high in quality, that the difference in duplicate identification procedures results in negligible disparity in the quality of reads in the output file. You can read more about this here: SAMBLASTER_Supplemental.pdf. If your reads are from a sequencing technology with a significantly higher per-base error rate, the disparity will be higher. In that case, I suggest you do a run on a trial file and determine whether the trade-off between speed and output reads is the correct one for your application. |
Cool. Very clearly explained. Thanks! |
Reopening, as this is not the first time this question has been asked. Someone will undoubtedly want to know in future. |
Release 0.1.25 provides better support for using samblaster to mark duplicates in files containing singleton long reads in two ways. First, the algorithm for finding duplicates for singletons was changed to allow forward and reverse strand reads to be duplicates of one another when the rest of the positional information of the alignment matches. Second, a --maxReadLength parameter has been added to fix the issues raised in issue #43. See the release notes for more information. |
Hi, I saw a repo used Samblaster for removing single-end duplicate(Repo link).
Is Samblaster suitable for removing single-end reads? If so, does it work similar to Picard-MarkDuplicates?
Thank you for your time.
Ji
The text was updated successfully, but these errors were encountered: