This document describes filtering and downsampling which may be applied to BAM files before bamQC metric computation.
Filtering removes undesirable reads. Downsampling obtains a subset of larger BAM files to enable faster processing. All downsampling operations, including checks of size thresholds, are applied after filtering. Read pairing is preserved; if read 1 of an aligned pair is downsampled, read 2 will be as well.
Picard MarkDuplicates has a separate downsampling method from other metrics. The two downsampling operations are implemented as separate WDL tasks, which can be efficiently parallelized by Cromwell.
Downsampling is not applied when using the mosdepth tool to compute depth of coverage.
We remove the following undesirable read types:
- Non-primary alignments: Important to avoid double-counting (or multi-counting) of reads. Defined as having samtools flag 256 (secondary alignment) or 2048 (supplementary alignment).
- Unmapped reads: Defined as having samtools flag 4 (segment unmapped).
- Low-quality alignments: Reads with a low alignment quality score (default minimum = 30)
Totals of non-primary, unmapped, and low-quality reads are recorded in workflow output. The excluded reads are not used for subsequent QC metric computation.
For "slow" computationally intensive metrics other than MarkDuplicates, we downsample by choosing a subset of reads at random. "Fast" metrics do not require downsampling, and are always computed on the full-sized filtered input.
The distinction between "fast" and "slow" metrics is covered in the bam-qc-metrics documentation.
If a BAM file has more reads than a threshold T, downsample to a subset containing T reads.
The default value of T is 100000. For example:
- A BAM file with 17000000 reads will be downsampled to 100000.
- A BAM file with 100001 reads will be downsampled to 100000.
- A BAM file with 99999 reads will not be downsampled.
So, if the original number of reads is 100000 or greater, downsampling is guaranteed not to reduce it below 100000.
Downsampling thresholds are always evaluated on the BAM file after filtering.
We combine two ways of downsampling with samtools.
This is done using samtools view -s $SEED.$PROB
, where $SEED
is a random seed and $PROB
is a decimal expressing the probability of retaining a read. For example, to sample reads with a probability of 0.05 and random seed 99, use samtools view -s 99.05
.
This method is very fast. Because it is probabilistic in nature, it does not sample an exact number of reads. Given 1 million reads and a sampling parameter of 0.05, it will iterate over the reads one by one, with a 0.05 probability of keeping each read. This will retain approximately 50000 reads, but the final total could plausibly be 49993 or 50010.
We would like to sample an exact number of reads. The procedure recommended by the samtools developers is to use samtools collate
, awk
and samtools sort
. This is rather slow and inefficient, making it impractical for very large inputs.
Unlike random downsampling, this method cannot take a random seed as input; given an input BAM file and output size, it will always select the same subset of the input.
In order to combine the speed of random sampling with the precision of exact sampling, we implement downsampling as a two-stage process. For sufficiently large input, we first predownsample approximately, getting an set of intermediate size somewhat larger than the desired output. Then, we downsample exactly to get the final output set.
Predownsampling is applied if the original input is at least 2 times the size of the final output, and the final output size is above an absolute minimum of 10000 reads. (For very small datasets, random sampling might accidentally produce an intermediate set smaller than the desired final size.) The predownsampling target size is 1.5 times the size of final output.
Predownsampling and downsampling are implemented as a single streamed command, so the predownsampled intermediate data is never written to disk.
Let N
be the size of the original input, and T
be the target size for downsampled output.
N=25000, T=100000
. BecauseN < T
, no downsampling of any kind takes place.N=125000, T=100000
. BecauseN < 2T
, no predownsampling takes place. The input is exactly downsampled to 100000 reads.N=250000, T=100000
. BecauseN > 2T
andT > 10000
we apply predownsampling. The input is randomly predownsampled to an intermediate set of approximately 150000 reads. Then, the intermediate set is exactly downsampled to the final set of 100000 reads.N=2500, T=1000
. AlthoughN > 2T
, we haveT < 10000
; so predownsampling is omitted, and we exactly downsample to 1000 reads.
Parameter | Description |
---|---|
targetReads |
Desired number of reads in downsampled set. Most likely parameter to need adjustment; the others are details of implementation. |
minReadsAbsolute |
Absolute minimum number of reads to allow pre-downsmapling. Defaults to 10000. |
minReadsRelative |
Minimum size of original dataset, relative to targetReads , in order for pre-downsampling to be carried out. Defaults to 2. |
precision |
Number of digits to retain in the fractional string for pre-downsampling. Defaults to 8. |
preDSMultiplier |
Approximate size of pre-downsampled set (if any), relative to targetReads . Defaults to 1.5. |
Random downsampling is not recommended for MarkDuplicates. This is because the number of duplicate pairs retained scales as the square of the randomly downsampled fraction. (Recall that read pairing is always preserved by downsampling; a "duplicate pair" is a pair of paired reads.)
For example, suppose we have a BAM file with 250 million reads and a 10% duplicate rate, giving 25 million duplicates. We downsample to 1 million reads, retaining 1/250
of the original input. So the probability of downsampling both members of a duplicate is (1/250)*(1/250) = 1/62500
, and the downsampled set has approximately 400 duplicate pairs. But because of the random nature of this process, the number sampled may be significantly larger or smaller. In empirical tests, naively scaling up from a sample of 1/62500
has given results at least 25% off the true value.
To obtain a more robust estimate, we use samtools to downsample reads aligned to a specific region of the human genome. Since both members of a duplicate should be aligned to the same locus, this method preserves duplicate pairs roughly in proportion to the fraction downsampled. In fact there will be some variation as duplicates are not evenly distributed across the genome, but the result is still more robust than random downsampling.
The number of reads downsampled by this method cannot be exactly predicted, as it depends on coverage of the chosen sub-region.
Downsampling is applied to specific chromosomes; and for very large files, to specific intervals within each chromosome. The chromosome and interval settings may be customised if needed.
Parameter | Description |
---|---|
threshold |
Minimum number of reads for downsampling. Defaults to 10 million. |
chromosomes |
Array of chromosome names for downsampling. Defaults to ["chr12", "chr13"] |
baseInterval |
Base width of interval for downsampling. Defaults to 15000. |
intervalStart |
Start of downsampling interval on each chromosome. Defaults to 100000. |
customRegions |
Custom downsampling regions. Format is string input to samtools, eg. chr1:1-1000000 chr2:10001-20000 . Defaults to the empty string "" . If set to a value other than "" , this will override the chromosomes , baseInterval , and intervalStart parameters. |
If the customRegions
parameter is in effect, downsampling is very simple. If the number of reads is greater than threshold
, reads are downsampled to the intervals specified in customRegions
; otherwise, no downsampling is done.
The following only applies if customRegions
is not in effect.
Let R
be the number of reads and T
be the minimum threshold for downsampling.
- If
R <= T
, no downsampling takes place. - If
T < R <= 10T
, the entire sequence of the chromosomes specified inchromosomes
is used for downsampling. - If
10T < R <= 100T
, the same interval is used within each chromosome. The interval begins atintervalStart + 1
and ends atintervalStart + baseInterval*1000
. - For larger values of R, the interval width is scaled down as shown in the table.
Input reads (R) vs. threshold (T) | Width wrt base (B) | Downsampling |
---|---|---|
R <= T |
None | No |
T < R <= 10T |
(Entire chromosome) | Yes |
10T < R <= 102T |
103B | Yes |
102T < R <= 103T |
102B | Yes |
103T < R <= 104T |
10B | Yes |
R > 104T |
B | Yes |
With the default interval values T=10000000
and B=15000
, and default chromosomes, we have:
Input reads (R) | DS chromosomes | Interval width |
---|---|---|
R <= 107 |
None | None |
107 < R <= 108 |
chr12 & chr13 | (Entire chromosome) |
108 < R <= 109 |
chr12 & chr13 | 1.5x107 |
109 < R <= 1010 |
chr12 & chr13 | 1.5x106 |
1010 < R <= 1011 |
chr12 & chr13 | 1.5x105 |
R > 1011 |
chr12 & chr13 | 1.5x104 |
The default parameters were chosen so that, when downsampling a whole-genome BAM file, the downsampled set is roughly between 1 million and 10 million reads. Empirically, this size range has been large enough to give a reasonably accurate sample, but small enough for Picard MarkDuplicates to be tractable. (Note that the default chromosomes 12 and 13 together are approximately 8% of the human genome.)
By default we downsample reads aligned to chromosomes 12 and 13, because they appear in all targeted sequencing panels in use at OICR as of 2020-03-10. For an up-to-date list of panels, see the interval-files repository.
We also need to consider what fraction of the targets falls within the targeted interval. Consider three example scenarios:
- Too few reads: None of the targeted regions fall within the downsampled chromosomes. The downsampled set contains no reads of interest, so analysis fails.
- About enough reads: Approximately 10% of the target regions fall within the downsampled chromosomes. Downsampling behaves roughly as on a whole-genome sample.
- Too many reads: All of the targeted regions fall within the downsampled chromosomes. The downsampled set is roughly 10 times larger than for a whole-genome sample. This may be acceptable; but could cause problems if the downsampled set is too large for Picard MarkDuplicates to be run efficiently.
We need to choose a suitable depth and range of downsampling for a targeted library. This is done respectively by setting the threshold
parameter; and setting either customRegions
or the chromosome and interval parameters. For example, the "too many reads" scenario above can be addressed by reducing the threshold for downsampling.