-
Notifications
You must be signed in to change notification settings - Fork 9
2021_Introduction to K mer spectra analysis
Most of the genomes sequenced are pandora boxes - completely undescribed genomes. While cytological techniques and flow-cytometry are the best way to generate some general insights about the genome and its structure, they are hard to scale, and unfortunately require additional expertise. K-mer spectra analysis is an alternative way to infer basic genomic properties directly from sequencing data. It provides us with an elegant way to estimate heterozygosity, genome size and repetitive fractions prior to genome assembly. Furthermore, kmer spectra analysis can be also used as a reliable QC of sequencing libraries.
In this module, we will first understand the logic behind decomposing reads into kmers and explore the basic properties of the kmer spectra on a variety of genomes.
- Video on YouTube: Introduction to K mer spectra analysis
- KMC: a faster and versatile k-mer counter
- GenomeScope - a kmer spectra analysis tool suited for polyploids as well
- R
All the required software in this module can be installed via conda, we will call the environment kmer_tools
, check the installation instructions.
There are plenty of k-mer counters out there. You can pick your favourite to get your histograms (check the list of k-mer counters in other k-mer resources wikipage)... Here we will use KMC, a very well-established k-mer counter that is both very fast and quite versatile.
KMC
When working with KMC, we need to create for each read set a k-mer database, that can look for example like this
ls *.fastq.gz > FILES
# kmer 21, 16 threads, 64G of memory, counting kmer coverages between 1 and 100000x
kmc -k21 -t16 -m64 -ci1 -cs100000 @FILES <data_base_name> <directory_to_write_temporary_files>
The important parts are that the k
is already specified (-k
), as well as the range of k-mer coverages that are calculated (-ci
and -cs
). If the input file name starts with @
(like @FILES
in the example), it means that KMC will read a text file FILES
where it expects to find a list of read files that will be used for the construction of the database. Finally, KMC generates quite a few temporary files, therefore on clusters, we recommend to specify <directory_to_write_temporary_files>
directory on a local disk (in our cluster environment you can specify $SCRATCH
, which is a directory that is created for each job). A working example of a slurm script for building the database is located at /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/build_KMC_db.sh
.
Then we can query the database for various things, such as getting a k-mer histogram. That would be
kmc_tools transform <data_base_name> histogram <name_of_my_histogram_k21.hist> -cx100000
A working example of a slurm script for for extracting the histogram is located at /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/get_histogram.sh
.
Now, let's try to generate the histogram for a range of k
values for one of two ecoli datasets (SRR1770413
, SRR857279
) we downlaoded for you /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/ecoli
. Both kmc
and GenomeScope
are installed in a conda environment /cluster/projects/nn9458k/oh_know/.conda/kmer_tools
(installation instructions), to access the environment you can execute
module purge
module load Miniconda3/4.9.2
conda activate /cluster/projects/nn9458k/oh_know/.conda/kmer_tools
remember to write this in scripts for the cluster too.
Here is an example using build_KMC_db.sh
script we prepared for you:
cd $USERWORK
cp /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/*sh .
cat ./build_KMC_db.sh
ls /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/ecoli/SRR1770413_* > files
mkdir kmer_dbs
sbatch ./build_KMC_db.sh files kmer_dbs/ecoli_SRR1770413
this will generate kmer_dbs/ecoli_SRR1770413.kmc_pre
and kmer_dbs/ecoli_SRR1770413.kmc_suf
files that are together a KMC database.
And assuming you chose the solution above, here is how you can use the newly generated database to extract a kmer histogram as follows:
cat get_histogram.sh
sbatch ./get_histogram.sh kmer_dbs/ecoli_SRR1770413 ecoli_SRR1770413_k21.hist
Plotting various k
You can a different k-mer counter than KMC if you feel comfortable with figuring it out on your own. You can check if it is available on the cluster by module spider <name_of_a_counter>
.
Now, generate a k-mer histogram for a range of k values! Then use your favourite plotting method to look at how the spectrum changes with different k. Here is an example of how it can be plotted in R, but don't let us restrain your creativity
cov <- read.table('SRR1770413_k21.hist')[, 2]
ylim <- c(0, max(cov[20:length(cov)]))
xlim <- c(0, quantile(cov, 0.998))
barplot(cov, ylim = ylim, xlim = xlim)
How did the sequencing go? What is the sequencing coverage? Was the E. coli population sequenced genetically uniform?
If you are done exploring E. coli, you can also check lambda phage datasets /cluster/projects/nn9458k/oh_know/teachers/kamil/data/genome_characterisation/lambda
- fair warning, the sequencing is a lot messier.
There are a bunch of programs to fit genome parameters from k-mer spectra. Within this workshop, you will also hear about tetramer in Hannes' talk, but the main tool for fitting models will be Genomescope 2.0, which is first introduced by (Vurture et al. 2017)[https://doi.org/10.1093/bioinformatics/btx153] and late improved and extended for polyploid in a paper by Ranallo-Benavidez et al. 2020, which is the version used in this tutorial.
GenomeScope fits a bunch of negative binomials with a fixed distance to the k-mer spectra. Then assuming the whole genome has the same ploidy and the heterozygosity is distributed uniformally across the genome. There is an error term, but it's covering the sequencing errors, it can't recognize contaminations. The easiest way to learn about GenomeScope is to fit a few dozens of spectra and get a feeling for it.
Before doing that, you should start an interactive bash session on the cluster
srun --ntasks=1 --mem-per-cpu=5G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i
See bash tutorial for more details).
Then load the conda environment:
module purge
module load Miniconda3/4.9.2
conda activate /cluster/projects/nn9458k/oh_know/.conda/kmer_tools
The basic execution of the command line using genomescope is
genomescope.R -p <ploidy> -i <hist> -o <output_directory> -n <name_prefix>
Try to run the model on the ecoli / lambda phage histograms you generated. What do you think about the error rates of indivudal libraries? Do the estimated genome sizes make sense? If no, why you think it might be the case?
Alright, now that you got log of it. Try to fit a few more models. In /cluster/projects/nn9458k/oh_know/teachers/kamil/data/data_for_genomescope
are 8 directories full of k-mer histograms. They have various ks and sometimes there are multiple versions (and it's on you to figure out why they are different). You are / will be distributed in 8 breakout rooms, collaborate in each group to create models for all of the provided histograms, but pay the most attention to the histogram of your group (the one that has the same number as your breakout room). You will share with others what do you think is the right model and what do you think about the data.
- begonia
- bombina
- cape_honey_bees
- crayfish
- mercurialis
- springtails
- stick_insects
- strawberry
Hints:
- not every species is diploid, the ploidy is specified by parameter
-p
- some of the models won't converge to the correct monoploid coverage on their own, you can give GenomeScope prior on monoploid coverage by specifying parameter
-l
- not all the samples have the same ploidy of all the chromosomes
- check the high coverage tails of histograms directly in the
.hist
files. What is the most covered k-mer and why?
We are happy to discuss your own k-mer profiles.
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others