Calling Consensus Reads

Calling Consensus Reads - A Brief Overview

Reads with the same source molecule id are examined base-by-base to assess the likelihood of each base in the source molecule. The likelihood model consists of three major steps:

Adjusting the input base qualities
Computing the maximum posterior probability base quality
Adjusting the output consensus base quality

Glossary of Symbols

Symbol	Description
	The phred-scaled base quality for a single base. This is assumed to measure the error during sequencing.
	Subtract this base quality from the input base qualities (prior to capping)
	Cap the maximum base quality in the input (after shifting).
$Err_{pre}$	The Phred-scaled error rate for an error prior to the UMIs being integrated. Such an error could be introduced by deamination of the base in the template DNA or oxidation of the base during library preparation
$Err_{post}$	The Phred-scaled error rate for an error post the UMIs have been integrated but prior to sequencing (does not include the error during sequencing). Such an error could be introduced by target capture or amplification.
	The base of the `i`th read for a given position across all the reads.

Adjusting the input base qualities.

First, the base qualities are adjusted. The base qualities are assumed to represent the probability of a sequencing error (i.e. the sequencer observed the wrong base present on the cluster/flowcell/well). A fixed value is subtracted from the phred-scaled base qualities (ex. Q30 with a shift of 10 becomes Q20). Next, the base qualities are capped to a maximum phred-scaled value. These adjustments should only be used if there is a reason to believe the input base qualities are systematically over-estimated, otherwise this step should be ignored.

Next, the base qualities are converted to a probability (of error):

$P_{Q'} = 10^{(-Q'/10)}$

Next a probability is calculated to encapsulate both the probability of a sequencing error and the probability of a base substitution between the time the source molecule ids were attached to the DNA and the time the base was sequenced. The resulting probability is the error rate of all processes from right after integrating the molecular tag through to the end of sequencing.

$P_{Q'}' = Err_{post}*(1-P_{Q'}) + (1-Err_{post})*P_{Q'} + (Err_{post} * P_{Q'} * 2/3))$

This latter formula computes the probability of seeing an error in the base sequence if there are two independent error processes. We sum three terms:

the probability of an error in trial one and no error in trial two:
the probability of no error in trial one and an error in trial two:
the probability of an error in both trials, but when the second trial does not reverse the error in first one, which for DNA (4 bases) would only occur 2/3 times: $Pr(A=x\to y, B=y \to z) * Pr(x\neq z | x\neq y, y\neq z, \left \{ x, y, z \right \} \in \left \{ A, C, G, T \right \})$

Computing the maximum posterior probability base quality.

Second, a consensus sequence is called for all reads with the same source molecule id base-by-base. For a given base position in the reads, the likelihoods that an A, C, G, or T is the base for the underlying source molecule respectively are computed by multiplying the likelihood of each read observing the base position being considered. The probability of error (from the adjusted base quality $P'_{Q'}$ ) is used when the observed base does not match the hypothesized base for the underlying source molecule, while one minus that probability is used otherwise.

$L_{Call=B} = \prod_i \begin{Bmatrix} P_{Q',i}'/3 & if B \neq B_i \\ (1- P_{Q',i}')& if B = B_i \end{Bmatrix}$

The computed likelihoods are normalized by dividing them by the sum of all four likelihoods to produce a posterior probability, namely the probability that the source molecule was an A, C, G, or T from just after integrating molecular tag through to sequencing, given the observations. We apply Bayes rule and assume a uniform prior for each of the four hypotheses.

$Post_{Call=B} = \frac{LL_{Call=B}}{\sum_{C\in \left \{ A, C, G, T\right \}} LL_{Call=C}}$

The base with the maximum posterior probability is used as the consensus call, and one minus the posterior probability is used as its raw base quality.

Adjusting the output consensus base quality.

First consensus raw base quality (the posterior) is turned into an error probability: $Pr_{err} = 1 - Post_{Call}$

Next, the error probability is modified by incorporating the probability of an error prior to integrating the source molecule ids. Such an error could be introduced by deamination of the base in the template DNA or oxidation of the base during library preparation. These errors will present in all reads observing the same source molecule and thus will be present in the consensus.

$Pr_{err}' = Err_{pre}*(1-Pr_{err}) + (1-Err_{pre})*Pr_{err} + (Err_{pre} * Pr_{err} * 2/3))$

This error probability is converted back to a phred-scaled quality:

$Q_{call} = -10 * log10(Pr_{err}')$

Therefore, the probability used for the final consensus base quality is the posterior probability of the source molecule having the consensus base given the observed reads with the same molecular tag, all the way from sample extraction and through sample and library preparation, through preparing the library for sequencing (ex. amplification, target selection), and finally, through sequencing.

Finally, any consenus base having quality less than the minimum is masked ('N').

Caveats

This assumes:

each end of a pair is independent, and does not jointly call bases that overlap within a pair.
indel errors in the reads are not considered in the consensus model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly