-
Notifications
You must be signed in to change notification settings - Fork 9
Basics of genome modeling
Genome models are explicit estimates of the genomic features we can intuitively observe on the k-mer spectra ourselves. They usually consist of a fit of several assumed distributions (e.g. negative binomial) to the k-mer spectra. The coefficients of this fit are then transformed into the estimates using further assumptions that are specific to individual approaches. Genomic features include things like size of the genome, ploidy, heterozygosity, GC content, repetitiveness, and more. We will mostly focus on heterozygosity, genome size, and ploidy for this tutorial.
With an assumption that heterozygous loci as well as duplications are independent and uniformly distributed across the genome, we can use the principles of combinatorics to calculate the expected fraction of 1n k-mers given levels of heterozygosity and duplications, or other way around - estimate the two parameters given relative sizes of 1n and 4n peaks. Such model is implemented in GenomeScope (Vurture et al.). It is important to note that the uniform distribution of heterozygosity is a key model assumption
For a simple diploid case, the relative size of 1n peak, can be expressed in terms of heterozygosity as
1 - (1 - r)^k,
where r is the probability to observe a heterozygous nucleotide and k is the k-mer length. It is important to note that the probability of a nucleotide to be heterozygous is considered to be the same for each position in a genome, regardless of the genetic context or composition. More complete models exist, for example the coefficient can capture potential heterozygous loci among duplicates as well, but the principle remains the same, we always express the fraction of 1n k-mers as a function of heterozygosity and k-mer size assuming uniform distribution of heterozygous loci on the genome. See supplement of (Vurture et al.) for a full explanation of these expressions.
👆 Go back to Table of Content
👉 ⚒ Now that you know the basics try to manually fit a model manual model fitting
👉 📖 Read about some common challenges that crop up when fitting a diploid genome Common difficulties in characterisation of diploid genomes using k mer spectra analysis
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others