Identifying haplotypes within targeted amplicon sequencing datasets

For specific types of data it can be beneficial to use very short k-mers (k<10). Targeted amplicon sequencing can be used to analyse haplotypes averaging only 160bp. The aim of this analysis is to identify the species by comparing the query sequences to a reference panel. As such, we can generate k-mers from the reconstructed haplotypes, not from the reads directly. Because the haplotypes can be oriented by the primers, we need to work with the full k-mer set rather than canonical k-mers.

The trade-off in the choice of k would be between tolerance in sequence variation and captured complexity. Because we work with reconstructed haplotypes rather than reads, the k-mer coverage does not play a role in the trade-off. For large k there is little tolerance for variation between the query and the reference, while for small k there is a high chance that the same k-mer is found in multiple locations in the sequence (for example, in a 149 bp sequence, 5 evenly spread SNPs result in no 25-mers matching the reference). On the other hand, the chance that all 4-mers are unique in a sequence of the same length is incredibly small (<10–22). Based on these trade-offs, we selected 8-mers as a reasonable length. With a mean target length of 160 bp, the chance that all 8-mers within a haplotype are unique is 84%.

To perform species assignment, we compute the k-mer distance from the query haplotype to each haplotype in the reference panel. The k-mer distance quantifies the fraction of matching k-mers between query and reference . The Nearest Neighbour sequence is the reference haplotype that minimises the k-mer distance to the query haplotype. The species label is assigned by identifying the Nearest Neighbours for all amplicon targets of the query sample and aggregating their contributions to the assignment.