INRB, Kinshasa, DRC. 2023
Here is a list of all the virus bioinformatics tools
We are going to jump right in with metagenomics, but here is a brief introduction if you want to read something while Rob is talking.
We have created servers for you with all the software and data that you will need for these excercises.
There are two machines that you can use, if you don't have access to a server:
IP Addresses:
1: 115.146.85.154
2: 115.146.84.145
Install conda
, fastp
, minimap2
, samtools
using conda
We will use all of these programs today.
Download the CF data fastq files, or if you want, you can use your own fastq files and see what you find!
For this example, I will download:
If you are using a remote server, you can use wget
again to download these reads:
wget https://github.com/linsalrob/ComputationalGenomicsManual/raw/master/Datasets/CF/788707_20180129_S_R1.fastq.gz
wget https://github.com/linsalrob/ComputationalGenomicsManual/raw/master/Datasets/CF/788707_20180129_S_R2.fastq.gz
We are going to use the Illumina Adapters, and trim out:
- Sequences that are less 100 bp
- Sequences that contain 1 N
- Trim the adapters off the 3' and 5' ends of the sequences
Do you remember how to download the Illumina Adapters? The URL is https://github.com/linsalrob/ComputationalGenomicsManual/raw/master/SequenceQC/IlluminaAdapters.fa
Once you have downloaded the adapters, we can use this command:
mkdir fastp
fastp -n 1 -l 100 -i 788707_20180129_S_R1.fastq.gz -I 788707_20180129_S_R2.fastq.gz -o fastp/788707_20180129_S_R1.fastq.gz -O fastp/788707_20180129_S_R2.fastq.gz --adapter_fasta IlluminaAdapters.fa
When fastp
runs, you will get an HTML output file called fastp.html. This shows some statistics about the run.
The NCBI has several human genome versions specifically designed for inclusion in pipelines like this.
A. GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
A gzipped file that contains FASTA format sequences for the following:
1. chromosomes from the GRCh38 Primary Assembly unit.
Note: the two PAR regions on chrY have been hard-masked with Ns.
The chromosome Y sequence provided therefore has the same
coordinates as the GenBank sequence but it is not identical to the
GenBank sequence. Similarly, duplicate copies of centromeric arrays
and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked
with Ns (locations of the unmasked copies are given below).
2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
5. Epstein-Barr virus (EBV) sequence
Note: The EBV sequence is not part of the genome assembly but is
included in the analysis set as a sink for alignment of reads that
are often present in sequencing samples.
B. GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
6. alt-scaffolds from the GRCh38 ALT_REF_LOCI_* assembly units.
C. GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_full_analysis_set.fna.gz, plus:
7. human decoy sequences from hs38d1 (GCA_000786075.2)
D. GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
7. human decoy sequences from hs38d1 (GCA_000786075.2)
For this work, we are going to use GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz which contains everything.
If you want to download it, you can use wget
like we have done before!
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
minimap2 --split-prefix=tmp$$ -a -xsr GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz fastp/788707_20180129_S_R1.fastq.gz fastp/788707_20180129_S_R2.fastq.gz | samtools view -bh | samtools sort -o 788707_20180129.bam
samtools index 788707_20180129.bam
Here is the samtools specification, and the description of the columns is on page 6.
Now, we use samtools
flags to filter out the human and not human sequences. You can find out what the flags mean using the samtools flag explainer
mkdir human not_human
samtools fastq -F 3588 -f 65 788707_20180129.bam | gzip -c > human/788707_20180129_S_R1.fastq.gz
echo "R2 matching human genome:"
samtools fastq -F 3588 -f 129 788707_20180129.bam | gzip -c > human/788707_20180129_S_R2.fastq.gz
samtools fastq -F 3584 -f 77 788707_20180129.bam | gzip -c > not_human/788707_20180129_S_R1.fastq.gz
samtools fastq -F 3584 -f 141 788707_20180129.bam | gzip -c > not_human/788707_20180129_S_R2.fastq.gz
samtools fastq -f 4 -F 1 788707_20180129.bam | gzip -c > not_human/788707_20180129_S_Singletons.fastq.gz
In the above example, we started with two fastq files, removed adapter sequences, mapped them to the human genome, and then separated out the human and not human sequences.
We can combine all of that into a single snakemake
file, and it will do all of the steps for us.
See the Snakemake section for details on how to run these two commands in a single pipeline.
See the Sequence Assembly section for a discussion of current sequence assemblers.
To assemble the reads from the CF patient, we can install spades. Here is the spades manual
mamba install -y spades
or, if you did not install mamba, you can use
conda install -y spades
Then, we can assemble the filtered bacterial sequences using spades:
spades.py --meta -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -o not_human_assembly
Once you have assembled the sequences, the first step is to visualise the assembly with bandage, and hopefully you will find an image like this
You can find the Hecatomb tutorial on the readthedocs website
You can download the kraken2 databases and run Kraken2 on your samples.
Just as with taxonomy, there are two broad approaches to figuring out the functions that are present.
There are a suite of tools that use heuristics (en Français: heuristique). For example, superfocus uses k-mers, short sequences, to find the functions that are present.
Another way to find the functions, is to use MMSeqs2 easy-taxonomy. You can find more details about the easy-taxonomy in the MMSeqs2 manual