-
-
Notifications
You must be signed in to change notification settings - Fork 69
Calling Duplex Consensus Reads
Calling consensus reads from duplex data is the process of taking reads that were generated in a way such as to allow post-hoc identification of which sequencing reads are derived from the paired strands of an original duplex or double-stranded source molecule of DNA. One such example is the process outlined by Kennedy et al which attaches UMIs to each end of a source molecule.
Mathematically the process is very similar to the one outlined for calling consensus reads from single-UMI data, though the mechanics are somewhat different.
The process outlined below is implemented in the CallDuplexConsensusReads
program in fgbio, which is run after first grouping reads with GroupReadsByUmi --strategy=paired
.
The high level process starts with a group of reads identified as originating from the same double-stranded source molecule. The two strands of the original molecule are labeled, arbitrarily, as A and B and each read is known to have originated from either the A strand or the B strand. The process proceeds through the following steps:
- Reads are split into four sub-groups:
- Strand
A
and read 1 (A1
s) - Strand
A
and read 2 (A2
s) - Strand
B
and read 1 (B1
s) - Strand
B
and read 2 (B2
s)
- Strand
- Reads are unmapped and, if necessary, reverted to sequencing order
- Quality trimming, if enabled
- Remaining low-quality bases are masked (i.e. converted to
N
s) - Reads are further trimmed to the length of the insert if the insert is shorter than the read length
- Reads are filtered based on their Cigar (alignment structure) to ensure reads are always in phase (in two groups:
A1
s andB2
s, andA2
s andB1
s). - Four single-strand consensus reads are generated, one each for
A1
s,A2
s,B1
s, andB2
s - Two duplex consensus reads are generated by combining the
A1
andB2
consensus reads, and theA2
andB1
consensus reads
Each step is further detailed below.
Reads are split into groups based on the strand of the original molecule (A
or B
) and whether they are sequencing read 1 or 2. The expectation is that read 1s derived from the A
strand will match read 2s derived from the B
strand and vice versa.
In order to generate consensus reads it is necessary to have all reads in the same orientation. To achieve this reads are reverted to sequencing order, even if they were aligned to the negative strand of the genome.
An optional step, reads can be end-trimmed to remove low quality bases at the end of sequencing. This is highly recommended as removing low-quality bases will reduce the number of disagreements in the consensus and reduce the number of no-calls (N
s) produced in the consensus reads. Trimming is performed using the same running-sum algorithm as implemented in bwa
.
A parameter to the consensus caller is a minimum input base quality. Bases below the minimum quality threshold are converted to N
s in the input reads to ensure that they are not used elsewhere in the process. If quality trimming it disabled, the read is truncated at this step to remove any contiguous N
s at the end of the read.
Reads longer than the insert length will necessarily read into the adapter sequence. For single-UMI data producing consensus calls of the adapter sequence is not a significant problem. However for duplex data, where we combine A1
and B2
reads, we generate consensus reads from raw reads that may read into different adapter sequences! Attempting to call consensus bases from these sequences will result in a high number of disagreements and many no-calls, which may cause consensus reads to be erroneously filtered out as low-quality at a later stage. Therefore reads are trimmed to their insert length prior to consensus calling.
Since we do not generate a multiple-alignment of all raw reads to each other, any length errors (insertions or deletions) in the raw reads may result in reads being out of alignment with one another, and erroneous consensus calls being made. Consider the following reads:
1: ACGTGACTGACTGACTAGCTTTTTTTAGACTAGCTACTACT
2: ACGTGACTGACTGACTAGCTTTTTTTAGACTAGCTACTACT
3: ACGTGACTGACTGACTAGCTTTTTTTTAGACTAGCTACTAC
Without correction the third read, with an additional T incorporated into it's poly-T, will generate many disagreements with the other two reads. Since all reads are derived from the same source molecule any such differences must be errors.
To handle this we create groups of reads who's alignments are compatible with one another, and use only the reads from the largest group for generating the consensus. For example, given the following raw reads post-trimming (padding with -
added for ease of reading):
1: ACATGGTACTTAGGTACGTACGGGG-TGATATG
2: ACATGGTACTTAGGTACGTACGGGG-TGATATG
3: ACATGGTACTTAGGTACGTACGGGG-TGATATG
4: ACATGGTACTTAGGTACGTACGGGGGTGATATG
5: ACATGGTACTTAGGTACGTACGGG--TGATATG
6: ACATGGTACTTAGGTACGTACG
three groups would be created: (1,2,3,6), (4,6) and (5,6), and the first group selected.
This process is performed independently on A1
+B2
and B1
+A2
reads.
The penultimate step is to generate four single-strand consensus reads, one for each group of reads: A1
, A2
, B1
, B2
. This is done by feeding the raw reads for each group through the vanilla consensus calling process as documented here.
The final step is generation of the R1 and R2 duplex consensus reads. This is done very simply on a base by base basis by merging the appropriate A
read with the appropriate B
read as follows:
- If the bases agree between
A
andB
reads, the resulting quality score is simplyQ(A) + Q(B)
- If the bases disagree between
A
andB
reads and the quality scores differ, the resulting base is the base with the higher quality score, and the resulting quality score isQ(higher) - Q(lower)
. - If the bases disagree and have identical quality the base is arbitrarily the base from
A
and the quality score is 2 (the minimum Phred quality score).