Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--USE_END_IN_UNPAIRED_READS and --USE_UNPAIRED_CLIPPED_END options for picard MarkDuplicates have no effect #1959

Open
ChrisHIV opened this issue Apr 16, 2024 · 10 comments

Comments

@ChrisHIV
Copy link

ChrisHIV commented Apr 16, 2024

Bug Report

Affected tool(s)

picard MarkDuplicates with the --USE_END_IN_UNPAIRED_READS and --USE_UNPAIRED_CLIPPED_END options

Affected version(s)

Latest public release version [3.1.1]

Description

The --USE_END_IN_UNPAIRED_READS and --USE_UNPAIRED_CLIPPED_END options have no effect in minimal test data. From my understanding of the help messages for these options (reproduced at the bottom of this message), the former should toggle whether or not we consider unpaired reads to be duplicates if they have the same start position but different end positions, and the latter should toggle whether clipped ends of unpaired reads are included or excluded when determining duplicates (I do not understand whether inclusion/exclusion corresponds to true/false for this bool or vice versa).

Steps to reproduce

The attached reads_sam.txt has 8 reads mapped to an 8-bp reference genome, attached as
reference_fasta.txt. (Both files have had their extensions changed to .txt to allow attachment.) The attached image shows the reads for convenience.
reads_sam
These reads all have the central 6bp mapped, but they vary in whether there is an additional base at one end or the other and whether that base is mapped or clipped. After renaming reads_sam.txt to reads_sam.sam to clarify the format, run e.g.

input=reads_sam.sam
java -jar picard.jar MarkDuplicates -I $input --REMOVE_DUPLICATES -M dedup_stats.txt -O test_dedup_NoFlags.sam
java -jar picard.jar MarkDuplicates -I $input --REMOVE_DUPLICATES -M dedup_stats.txt -O test_dedup_USE_END_IN_UNPAIRED_READS.sam --USE_END_IN_UNPAIRED_READS
java -jar picard.jar MarkDuplicates -I $input --REMOVE_DUPLICATES -M dedup_stats.txt -O test_dedup_USE_UNPAIRED_CLIPPED_END.sam --USE_UNPAIRED_CLIPPED_END
java -jar picard.jar MarkDuplicates -I $input --REMOVE_DUPLICATES -M dedup_stats.txt -O test_dedup_USE_END_IN_UNPAIRED_READS__USE_UNPAIRED_CLIPPED_END.sam --USE_END_IN_UNPAIRED_READS --USE_UNPAIRED_CLIPPED_END

Expected behavior

The four output sam files corresponding to the four combinations of these two binary flags should vary in which subset of reads are included after removing duplicates, because the reads vary in their potential to be considered duplicates based on the description of the flags.

Actual behavior

The four output sam files are identical in their read content (containing reads 1 and 6).

Additional comments

I tried to follow the recommendation to first post on the forum, but clicking on the 'Sign in' tab takes me to this page where I cannot see any option to sign in or create a new profile.
And here is the help for those two options, for convenience:

--USE_END_IN_UNPAIRED_READS <Boolean>
                              Make the end location of single end read be significant when considering duplicates, in
                              addition to the start location, which is always significant (i.e. require single-ended
                              reads to start andend on the same position to be considered duplicate) (for this argument,
                              "read end" means 3' end).  Default value: false. Possible values: {true, false} 
--USE_UNPAIRED_CLIPPED_END <Boolean>
                              Use position of the clipping as the end position, when considering duplicates (or use the
                              unclipped end position) (for this argument, "read end" means 3' end).  Default value:
                              false. Possible values: {true, false} 
@kockan
Copy link
Contributor

kockan commented Apr 17, 2024

Looks like those options are specific to flow-based reads, therefore they will not be used at all unless you also set --FLOW_MODE true in your command(s).

I've tried this with the example alignments you've provided and while it did run successfully, I'd recommend extra caution if you ever decide to use these options with non-flow-based single-end reads, since I'd assume they were not intended for such use.

@ChrisHIV
Copy link
Author

Thanks. Indeed using --FLOW_MODE on that test data, --USE_END_IN_UNPAIRED_READS changes which reads are duplicates. However, --USE_UNPAIRED_CLIPPED_END still has no effect and it's unclear to me to why: the description suggests that it toggles whether clipped ends are included or excluded, which I think should affect where reads are deemed to start and end, and thus affect which are considered duplicates.

Maybe helpful to add to the two help messages that --FLOW_MODE must be used? Also good to make the help clearer which of true/false corresponds to inclusion/exclusion of clipped ends for --USE_UNPAIRED_CLIPPED_END.

@kockan
Copy link
Contributor

kockan commented Apr 22, 2024

I agree with adding clarification(s) to the documentation. I'll try getting more details on whether these options are safe to be used with non-flow single-end reads before closing the issue.

@kachulis
Copy link
Contributor

perhaps there should be some error if those options are used without --FLOW_MODE turned on?

@kockan
Copy link
Contributor

kockan commented Apr 22, 2024

Absolutely, if these are truly only meant to be used with flow-based reads.

@kockan
Copy link
Contributor

kockan commented Apr 22, 2024

@meganshand @ilyasoifer tagging for opinions on this (and potentially similar issues with other options in this/other tools)

@ilyasoifer
Copy link
Contributor

@kockan - thanks! We will discuss and propose how to best deal with this.

@ilyasoifer
Copy link
Contributor

ilyasoifer commented May 3, 2024

@meganshand and @kockan, @ChrisHIV - unless you see a clear use case for single end reads (that are not of a constant length in bases) we will update the help string as suggested by the issue to indicate that it should only be used for flow reads

@kockan
Copy link
Contributor

kockan commented May 3, 2024

@ilyasoifer Sounds reasonable to me. One small additional request: if these options are set in the command-line arguments without --FLOW_MODE, we should throw an exception (edit: decided an exception would be better than a warning after some initial thought)

@droazen
Copy link
Contributor

droazen commented Sep 10, 2024

#1976 was merged, which adds a FLOW_ prefix to these options, but to close out this ticket we also need a check that flow mode is active when these options are specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants