Skip to content

Common difficulties in characterisation of diploid genomes using k mer spectra analysis

Lucía Campos edited this page Mar 29, 2024 · 3 revisions

The quality of fits of genome models is largely dependent on the quality of data, but also on the biological features of the genome. The most common problem of genome models is for the monoploid (1n) k-mer coverage converging on a “wrong genomic peak”. This usually happens, if the 1n coverage peak is not distinct. This can be caused by extremely low heterozygosity of the genome (i.e. the 1n signal is very weak), data contamination with other samples, or because the coverage is very low and the 1n peak largely overlaps with the error peak.

When the 1n coverage is not fit right, none of the estimated values will carry any biological information regarding the genome and it is important to visually inspect fits and make sure the estimates make sense in the context of the other known biology. For example, if we sequence a diploid selfing plant and the estimated heterozygosity is >5%, it is extremely likely that the true 1n coverage is ~½ of the estimated one. In GenomeScope we can add a flag “-n ” which adds a coverage prior and usually allows GenomeScope to converge to a biologically more relevant model.

What's next

In the next steps we will be going through some real examples to better explain the importance of a good model fit.

👉 ⚒ Let's start with identifying a low-sequencing coverage (depth) dataset here.

👆 Go back to Table of Content

Table of content

Introduction

k-mer spectra analysis

Separation of chromosomes

Species assignment using short k-mers

Others

Clone this wiki locally