expanded parameterization of align_and_count and additional output metrics #525
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds functionality to optionally filter reads after mapping in the
align_and_count
task, so the counts of mapped reads are comparable to those following filtering during genome assembly. It also adds new numeric outputs relevant for general QC purposes.New input parameters
The filtering has the following parameters:
filter_bam_to_proper_primary_mapped_reads
: enable filteringfalse
— no filtering is performeddo_not_require_proper_mapped_pairs_when_filtering
: do not exclude reads lacking the "proper pair" bit; this is helpful/necessary to set totrue
when using single-end reads as input if filtering is enabledfalse
— reads are filtered to proper pairs if filtering is enabledkeep_singletons_when_filtering
: singleton reads from paired-end data are kept; this does not affect single-end readsfalse
— singleton reads are excluded during filteringkeep_duplicates_when_filtering
: reads marked as duplicates are kept; this does not supersede exclusion for violations of other criteriafalse
— duplicate reads are excluded during filteringNew output metrics
This PR also adds new numeric output metrics to
align_and_count
:pct_total_reads_mapped
: the percent of input reads mapping to any of the input reference sequencespct_lesser_hits_of_mapped
: of the reads mapping to reference sequences input toalign_and_count
, the percent mapping to hits that are not the top hitThe new outputs are exposed in several of the workflows that have singular outputs from
align_and_count
. A few other workflows callalign_and_count
, but output an aggregate report with info from multiple inputs.Recommended usage
The following values are recommended for most use cases, to count high-quality read mappings with duplicates included.
filter_bam_to_proper_primary_mapped_reads=true
keep_duplicates_when_filtering=true