2021_Introduction to K mer spectra analysis

Most of the genomes sequenced are pandora boxes - completely undescribed genomes. While cytological techniques and flow-cytometry are the best way to generate some general insights about the genome and its structure, they are hard to scale, and unfortunately require additional expertise. K-mer spectra analysis is an alternative way to infer basic genomic properties directly from sequencing data. It provides us with an elegant way to estimate heterozygosity, genome size and repetitive fractions prior to genome assembly. Furthermore, kmer spectra analysis can be also used as a reliable QC of sequencing libraries.

In this module, we will first understand the logic behind decomposing reads into kmers and explore the basic properties of the kmer spectra on a variety of genomes.

Video on YouTube: Introduction to K mer spectra analysis

Tools

KMC: a faster and versatile k-mer counter
GenomeScope - a kmer spectra analysis tool suited for polyploids as well
R

All the required software in this module can be installed via conda, we will call the environment kmer_tools, check the installation instructions.

1. Generating k-mer spectra

There are plenty of k-mer counters out there. You can pick your favourite to get your histograms (check the list of k-mer counters in other k-mer resources wikipage)... Here we will use KMC, a very well-established k-mer counter that is both very fast and quite versatile.

KMC

When working with KMC, we need to create for each read set a k-mer database, that can look for example like this

ls *.fastq.gz > FILES
# kmer 21, 16 threads, 64G of memory, counting kmer coverages between 1 and 100000x
kmc -k21 -t16 -m64 -ci1 -cs100000 @FILES <data_base_name> <directory_to_write_temporary_files>

The important parts are that the k is already specified (-k), as well as the range of k-mer coverages that are calculated (-ci and -cs). If the input file name starts with @ (like @FILES in the example), it means that KMC will read a text file FILES where it expects to find a list of read files that will be used for the construction of the database. Finally, KMC generates quite a few temporary files, therefore on clusters, we recommend to specify <directory_to_write_temporary_files> directory on a local disk (in our cluster environment you can specify $SCRATCH, which is a directory that is created for each job). A working example of a slurm script for building the database is located at /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/build_KMC_db.sh.

Then we can query the database for various things, such as getting a k-mer histogram. That would be

kmc_tools transform <data_base_name> histogram <name_of_my_histogram_k21.hist> -cx100000

A working example of a slurm script for for extracting the histogram is located at /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/get_histogram.sh.

Now, let's try to generate the histogram for a range of k values for one of two ecoli datasets (SRR1770413, SRR857279) we downlaoded for you /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/ecoli. Both kmc and GenomeScope are installed in a conda environment /cluster/projects/nn9458k/oh_know/.conda/kmer_tools (installation instructions), to access the environment you can execute

module purge
module load Miniconda3/4.9.2
conda activate /cluster/projects/nn9458k/oh_know/.conda/kmer_tools

remember to write this in scripts for the cluster too.

Solution

Here is an example using build_KMC_db.sh script we prepared for you:

cd $USERWORK

cp /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/*sh .
cat ./build_KMC_db.sh

ls /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/ecoli/SRR1770413_* > files
mkdir kmer_dbs

sbatch ./build_KMC_db.sh files kmer_dbs/ecoli_SRR1770413

this will generate kmer_dbs/ecoli_SRR1770413.kmc_pre and kmer_dbs/ecoli_SRR1770413.kmc_suf files that are together a KMC database.
And assuming you chose the solution above, here is how you can use the newly generated database to extract a kmer histogram as follows:

cat get_histogram.sh
sbatch ./get_histogram.sh kmer_dbs/ecoli_SRR1770413 ecoli_SRR1770413_k21.hist

Plotting various k

You can a different k-mer counter than KMC if you feel comfortable with figuring it out on your own. You can check if it is available on the cluster by module spider <name_of_a_counter>.

Now, generate a k-mer histogram for a range of k values! Then use your favourite plotting method to look at how the spectrum changes with different k. Here is an example of how it can be plotted in R, but don't let us restrain your creativity

cov <- read.table('SRR1770413_k21.hist')[, 2]
ylim <- c(0, max(cov[20:length(cov)]))
xlim <- c(0, quantile(cov, 0.998))

barplot(cov, ylim = ylim, xlim = xlim)

How did the sequencing go? What is the sequencing coverage? Was the E. coli population sequenced genetically uniform?

If you are done exploring E. coli, you can also check lambda phage datasets /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/lambda - fair warning, the sequencing is a lot messier.

2. Fitting the first k-mer model

There are a bunch of programs to fit genome parameters from k-mer spectra. Within this workshop, you will also hear about tetramer in Hannes' talk, but the main tool for fitting models will be Genomescope 2.0, which is first introduced by (Vurture et al. 2017)[https://doi.org/10.1093/bioinformatics/btx153] and late improved and extended for polyploid in a paper by Ranallo-Benavidez et al. 2020, which is the version used in this tutorial.

GenomeScope fits a bunch of negative binomials with a fixed distance to the k-mer spectra. Then assuming the whole genome has the same ploidy and the heterozygosity is distributed uniformally across the genome. There is an error term, but it's covering the sequencing errors, it can't recognize contaminations. The easiest way to learn about GenomeScope is to fit a few dozens of spectra and get a feeling for it.

Before doing that, you should start an interactive bash session on the cluster

srun --ntasks=1 --mem-per-cpu=5G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i

See bash tutorial for more details).
Then load the conda environment:

module purge
module load Miniconda3/4.9.2
conda activate /cluster/projects/nn9458k/oh_know/.conda/kmer_tools

The basic execution of the command line using genomescope is

genomescope.R -p <ploidy> -i <hist> -o <output_directory> -n <name_prefix>

Try to run the model on the ecoli / lambda phage histograms you generated. What do you think about the error rates of indivudal libraries? Do the estimated genome sizes make sense? If no, why you think it might be the case?

3. Fitting more GenomeScope models

Alright, now that you got log of it. Try to fit a few more models. In /cluster/projects/nn9458k/oh_know/teachers/kamil/data/data_for_genomescope are 8 directories full of k-mer histograms. They have various ks and sometimes there are multiple versions (and it's on you to figure out why they are different). You are / will be distributed in 8 breakout rooms, collaborate in each group to create models for all of the provided histograms, but pay the most attention to the histogram of your group (the one that has the same number as your breakout room). You will share with others what do you think is the right model and what do you think about the data.

begonia
bombina
cape_honey_bees
crayfish
mercurialis
springtails
stick_insects
strawberry

Hints:

not every species is diploid, the ploidy is specified by parameter -p
some of the models won't converge to the correct monoploid coverage on their own, you can give GenomeScope prior on monoploid coverage by specifying parameter -l
not all the samples have the same ploidy of all the chromosomes
check the high coverage tails of histograms directly in the .hist files. What is the most covered k-mer and why?