Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expanded parameterization of align_and_count and additional output metrics #525

Merged
merged 13 commits into from
Mar 11, 2024

Conversation

tomkinsc
Copy link
Member

@tomkinsc tomkinsc commented Mar 9, 2024

Summary

This PR adds functionality to optionally filter reads after mapping in the align_and_count task, so the counts of mapped reads are comparable to those following filtering during genome assembly. It also adds new numeric outputs relevant for general QC purposes.

New input parameters

The filtering has the following parameters:

  • filter_bam_to_proper_primary_mapped_reads: enable filtering
    • default: falseno filtering is performed
  • do_not_require_proper_mapped_pairs_when_filtering: do not exclude reads lacking the "proper pair" bit; this is helpful/necessary to set to true when using single-end reads as input if filtering is enabled
    • default: falsereads are filtered to proper pairs if filtering is enabled
  • keep_singletons_when_filtering: singleton reads from paired-end data are kept; this does not affect single-end reads
    • default: falsesingleton reads are excluded during filtering
  • keep_duplicates_when_filtering: reads marked as duplicates are kept; this does not supersede exclusion for violations of other criteria
    • default: falseduplicate reads are excluded during filtering

New output metrics

This PR also adds new numeric output metrics to align_and_count:

  • pct_total_reads_mapped: the percent of input reads mapping to any of the input reference sequences
    • this is helpful for assessing the fraction of reads in a sample originating from sources corresponding to the reference sequences
  • pct_lesser_hits_of_mapped: of the reads mapping to reference sequences input to align_and_count, the percent mapping to hits that are not the top hit
    • this is helpful for assessing cross-talk between hits

The new outputs are exposed in several of the workflows that have singular outputs from align_and_count. A few other workflows call align_and_count, but output an aggregate report with info from multiple inputs.

Recommended usage

The following values are recommended for most use cases, to count high-quality read mappings with duplicates included.

  • filter_bam_to_proper_primary_mapped_reads=true
  • keep_duplicates_when_filtering=true

tomkinsc added 12 commits March 4, 2024 14:15
…in tasks_reports.wdl::align_and_count(); make this the default

add functionality to optionally filter reads to include only properly mapped airs in tasks_reports.wdl::align_and_count(); make this the default for align_and_count by setting the task input filter_bam_to_proper_primary_mapped_reads=true.
…additional metrics

add keep_duplicates_when_filtering toggle to align_and_count task; also have this task output additional metrics for the percent of mapped reads aligning to hits that are not the top hit, and the percent of total input reads that mapped to any of the align_and_count ref seqs (i.e. how much crosstalk, and how much of the total sample, respectively)
pipes/WDL/tasks/tasks_reports.wdl Outdated Show resolved Hide resolved
…ign_and_count

require values for the various filtering-related Boolean inputs in align_and_count, since the default values guarantee they'll be set
@dpark01 dpark01 added this pull request to the merge queue Mar 11, 2024
Merged via the queue into master with commit 1ce64ab Mar 11, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants