Highly heterozygous diploid

Heterozygous diploid samples are usually straightforward to model. They are less likely to require the manual intervention that we sometimes see with highly inbreed samples. This is because there are usually two well defined peaks that genomescope is able to correctly identify. In contrast, as we have seen in the previous example, when there is no heterozygous peak then genomescope may incorrectly identify the homozygous peak as heterozygous.

The european small ermine moth is a great example of this. Here, the spectra was made with k=31 based on PacBio HiFi data. We actually have two specimens from this species, so we can compare the two spectra.

Specimen 1:

Specimen 2:

In both cases the estimated heterozygosity is over 1%.

Checking the smudgeplots, we see that this is likely a diploid species. Screen Shot 2023-02-14 at 10 43 17

1. Since we have two specimens from the same species, can we merge the data together to get more coverage and thus potentially a better model??

No. Or at least, it isn't likely that this would improve the model. It is much more likely that this would result in a hard if not impossible to interpret histogram. This is because the two individuals here have different coverages and different levels of heterozygosity and error. As a fun at-home exercise, you can try merging the reads then re-running kmc and genomescope.

What's next

We hope these examples are providing a useful overview of what you can find when fitting your genome models depending on the genome characteristics of your organism of interest. However there is still a lot to explore, for example genome repetitiveness and plody levels.

👆 Go back to Table of Content

👉 ⚒ Let's try to figure out the genome size of a repetitive genome.

👉 📖 Read about characterization of polyploid genomes using k mer spectra analysis.