This script performs the phasing of resistance markers from polyclonal infections of Plasmodium falciparum utilizing only allele frequencies. As of now, 7 resistance markers across 4 mad4hatter amplicons from the dhfr
and dhps
genes are supported.
The following R packages must be installed: ggplot2
, dplyr
, gridExtra
, and optparse
.
Rscript FapR.R -i [resmarker_table_global_max_0_filtered.csv] -o [output_prefix]
- All alleles in the input file are true alleles. Appropriate filtering is strongly suggested.
- There are no copy number variants (CNV) on the amplicons.
FapR uses an iterative approach in which haplotypes are accepted based on:
- Probability of occurring in a sample: haplotypes built from highly abundant resmarkers are more likely to be true.
- Variance on the resmarker frequencies: haplotypes built from similarly abundant resmarkers are more likely to be true.
Figure 1. FapR's phasing algorithm.
Figure 2. Example of the phasing process. The best haplotype on each iteration (highest probability and lowest coefficient of variation) is highlighted in green and its assigned frequency in pink. Haplotypes with zero probability are highlighted in orange. This particular sample resulted in 3 haplotypes that add up to 99%.
Phased haplotypes are flagged based on:
-
Frequency in the sequencing run (assuming it is from a given population)
- A single threshold derived from population frequency
- Allows to catch haplotypes that are frequent in the run, but have low abundance in particular samples
- This flag takes precedence over the following
-
Limit of detection of each amplicon (experimentally tested)
- A threshold for each amplicon
- Allows to catch haplotypes that are rare in the run, but moderate to highly abundant in particular samples
- Allows to build partial haplotypes
Figure 3. FapR's flagging algorithm. Currently, population frequency threshold is the mean haplotype frequency.
Figure 4. Example of the flagging results. Checkmarks are accepted haplotypes: green = correctly phased and also frequent in the population/run; blue = correctly phased but rare in the population/run; purple = inconclusive phasing but frequent in the population/run. Orange crosses are haplotypes with inconclusive phasing and also rare (or absent) in the population/run, thus being inconclusive haplotypes.