Basics of genome modeling

Genome models are explicit estimates of the genomic features we can intuitively observe on the k-mer spectra ourselves. They usually consist of a fit of several assumed distributions (e.g. negative binomial) to the k-mer spectra. The coefficients of this fit are then transformed into the estimates using further assumptions that are specific to individual approaches. Genomic features include things like size of the genome, ploidy, heterozygosity, GC content, repetitiveness, and more. We will mostly focus on heterozygosity, genome size, and ploidy for this tutorial.

With an assumption that heterozygous loci as well as duplications are independent and uniformly distributed across the genome, we can use the principles of combinatorics to calculate the expected fraction of 1n k-mers given levels of heterozygosity and duplications, or other way around - estimate the two parameters given relative sizes of 1n and 4n peaks. Such model is implemented in GenomeScope (Vurture et al.). It is important to note that the uniform distribution of heterozygosity is a key model assumption

For a simple diploid case, the relative size of 1n peak, can be expressed in terms of heterozygosity as

1 - (1 - r)^k,

where r is the probability to observe a heterozygous nucleotide and k is the k-mer length. It is important to note that the probability of a nucleotide to be heterozygous is considered to be the same for each position in a genome, regardless of the genetic context or composition. More complete models exist, for example the coefficient can capture potential heterozygous loci among duplicates as well, but the principle remains the same, we always express the fraction of 1n k-mers as a function of heterozygosity and k-mer size assuming uniform distribution of heterozygous loci on the genome. See supplement of (Vurture et al.) for a full explanation of these expressions.