-
Notifications
You must be signed in to change notification settings - Fork 9
Contamination in skimming data (CONSULT)
Kamil S. Jaron edited this page Mar 22, 2024
·
1 revision
There are two ways to remove contamination:
- inclusion filters when you know what you are looking for and you have a reference genome.
- Here, you can use any tool, like BLAST, bowtie, etc.
- exclusion filters when you do not know what to exactly look for but you know what you do not want (e.g., bacteria)
We have bacterial/archaeal libraries available for both CONSULT and Kraken.
- GTDB is the most comprehensive library.
- All links to all of our reference libraries are available on our raw data github for CONSULT.
To query sequence reads against reference database we ran
mkdir consult
cd consult/
# Let's use as query a bunch of Drosophila genome skims (already on cluster)
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/Drosophila/ .
# I have already copied the reference dataset; let's link to it
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb/ .
consult_search -i all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb -c 1 -t 2 -q Drosophila/ 2>&1 |tee consult.log &
This runs for 5-10 minutes. While it runs, we can monitor it a bit:
top -u smirarab
tail -f consult.log
watch -n 10 wc -l ucseq_* Drosophila/*
After it finishes, now inspect the results.
less consult.log
wc -l ucseq_* Drosophila/*
Here are the arguments to CONSULT:
-
-i
- the name of the reference database -
-c
- the lowest number of k-mers required to mark sequencing read as classified. For instance, if at least one k-mer match is enough to classify a read, "c" should be set to 1. If at least two k-mer matches are required to call read a match, "c" should be set to 2. -
-t
- number of threads -
-q
- the name of the folder where queries are located
Note:
- CONSULT is a bit less of a professional-looking tool at the moment.
- We will improve it.
We suggest using the default value for alpha
option which is 0. This recommendation is based on our empirical findings from a previous paper.
To query kraken DB we use:
kraken2 --use-names --threads 24 --report REPORT_FILE_NAME --db DATABASE_NAME --confidence alpha --classified-out CLASSIFIED_FASTQ_FILE --unclassified-out UNCLASSIFIED_FASTQ_FILE QUERY_FASTQ_FILE > KRAKEN_OUTPUT_FILE
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others