-
Notifications
You must be signed in to change notification settings - Fork 11
Identifying haplotypes within targeted amplicon sequencing datasets
For specific types of data it can be beneficial to use very short k-mers (k<10). Targeted amplicon sequencing can be used to analyse haplotypes averaging only 160bp. The aim of this analysis is to identify the species by comparing the query sequences to a reference panel. As such, we can generate k-mers from the reconstructed haplotypes, not from the reads directly. Because the haplotypes can be oriented by the primers, we need to work with the full k-mer set rather than canonical k-mers.
The trade-off in the choice of k would be between tolerance in sequence variation and captured complexity. Because we work with reconstructed haplotypes rather than reads, the k-mer coverage does not play a role in the trade-off. For large k there is little tolerance for variation between the query and the reference, while for small k there is a high chance that the same k-mer is found in multiple locations in the sequence (for example, in a 149 bp sequence, 5 evenly spread SNPs result in no 25-mers matching the reference). On the other hand, the chance that all 4-mers are unique in a sequence of the same length is incredibly small (<10–22). Based on these trade-offs, we selected 8-mers as a reasonable length. With a mean target length of 160 bp, the chance that all 8-mers within a haplotype are unique is 84%.
To perform species assignment, we compute the k-mer distance from the query haplotype to each haplotype in the reference panel. The k-mer distance quantifies the fraction of matching k-mers between query and reference . The Nearest Neighbour sequence is the reference haplotype that minimises the k-mer distance to the query haplotype. The species label is assigned by identifying the Nearest Neighbours for all amplicon targets of the query sample and aggregating their contributions to the assignment.
Let's use a mosquito (Anopheles sp.) amplicon sequencing dataset for species assignment using short k-mers.
👆 Go back to Table of Content
👉 ⚒ Follow our tutorial to use k-mers to quantify species similarity.
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others