Email: ltj8@cdc.gov
Assembly is a “catch-all” term used to describe methods where we combine shorter individual reads into longer contiguous sequences called contigs
Because the sequencing process works by breaking the original DNA into smaller fragments, the assembly process is conceptually similar to putting together an image puzzle from its many pieces
We will learn how to de novo assemble reads sequenced by the Illumina sequencing platform using SPAdes, an assembly toolkit containing various assembly pipelines
- Improving of the reads quality (remove adapters, trim reads, etc..)
- De novo assembly of the overlapping reads into contigs
- Joining contigs into scaffolds
- Comparison with other known genomes
- Filling the gaps
- Annotation of the assembled genome
- Visualize genome annotation
Sequence assembly is perhaps the application domain of bioinformatics where skill and expertise are the most difficult to identify and define
Most come with a bewildering array of parameters - the purpose of which are not explained, yet many will have profound effects on the results that they produce
Trial and error are one of the most commonly used strategies - you will have to keep tuning the parameters and rerun the entire process hoping that the results improve
Thus any expertise built on trial and error will have to be accumulated over a much more extended period
Finally, even when assembly appears to work, almost always it will contain several severe and substantial errors. That is where, in our opinion, bioinformatics expertise matters more
The ability to understand, visualize and correct the mistakes of an assembly has a utility that will outlast the present and is more valuable than knowing the exact invocation of a tool by heart
N50: length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
- A simpler explanation of N50 starts by ordering contigs by length
- Suppose we have 10 different contigs (designated by XXXXXX ) and we ordered these by their decreasing sizes:
Contig Length Sum
XXXXXXXXXX 10 10
XXXXXXXXX 9 19
XXXXXXXX 8 27
XXXXXXX 7 34
XXXXXX 6 40
XXXXX 5 45
XXXX 4 49
XXX 3 52
XX 2 54
X 1 55
- The sum of these lengths starting with the longest is 55. Half of that is 27.5
- Go down on this list and add up the lengths to find the contig where the cumulative length exceeds this half value
- When we hit contig number 7 we have 10 + 9 + 8 + 7 = 34, this value is larger than 27.5 so it is at this point at least half of the genome is stored in contigs of size 7 or greater - N50 is then 7
- “At least half of the nucleotides in this assembly belong to contigs of size 8bp or longer”
- One of the biggest problems using the N50 metric as the primary means of evaluating assembly quality is that it rewards “misjoins”
- A “misjoin” is the error of concatenating separate contigs into a single segment
- Such mistakes can happen when the evidence is insufficient, but the algorithm is tuned to be overly generous in accepting these pieces of evidence
- Also applying cutoffs can lead to odd situations when selecting contigs would lead us to be either well under or well over the 50%
- A k-mer are all possible subsequences of size k
- The rationale for breaking our reads into even smaller pieces, k-mers is that we want to identify “correct” k-mers, those that originate from the real data
- The assumption is that whatever errors the reads may have these are distributed randomly - hence will produce different erroneous k-mers
- The correct k-mers will always be the most abundant
- In general, the longer a k-mer is, the fewer identical k-mers to it exist
- At the same time it is also true the longer a k-mer is, the more likely is that it will contain an error
- When it comes to assembly processes the rule of reasoning is that:
- A larger k value allows resolving more repetitions
- A smaller k increases the chances of seeing a given k-mer
- Hence the selection of a k-mer is a tradeoff between longer repeats and more reliable measures
- In general, you should assemble sequences using the largest k-mer size possible, such that the k-mer coverage is sufficient
- You may estimate these via trial and error (as stated above) but there are also tools that will assist you in determining the most likely k-mer sizes and coverage cutoffs
Phage therapy may to be used as an alternative to antibiotics or, as a supplementary approach to treat some bacterial infections
Bacteriophages have been applied in clinical practice for the treatment of localized infections in wounds, burns, and trophic ulcers, including diabetic foot ulcers (PMC6083058)
Bacteriophages that were successful in treating diabetic foot disease were sequenced using NGS technology
$ mkdir -p ~/denovo_assembly/data/untrimmed_fastq ~/denovo_assembly/data/trimmed_fastq
$ cd ~/denovo_assembly/data/untrimmed_fastq
$ wget -nv https://figshare.com/ndownloader/files/45571629 -O 169_S7_L001_R1_001.fastq.gz --no-check-certificate
$ wget -nv https://figshare.com/ndownloader/files/45571626 -O 169_S7_L001_R2_001.fastq.gz --no-check-certificate
$ cp /home/gitpod/miniconda/envs/denovo_assembly/share/trimmomatic-0.39-2/adapters/TruSeq3-PE-2.fa .
$ fastqc *.fastq.gz
$ trimmomatic PE \
169_S7_L001_R1_001.fastq.gz 169_S7_L001_R2_001.fastq.gz \
169_S7_L001_R1_001.trim.fastq.gz 169_S7_L001_R1_001un.trim.fastq.gz \
169_S7_L001_R2_001.trim.fastq.gz 169_S7_L001_R2_001un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:TruSeq3-PE-2.fa:2:40:15
$ fastqc *trim.fastq.gz
$ mv *trim* ../trimmed_fastq
$ cd ../trimmed_fastq
We will be using the program SPades for de novo assembly
$ spades.py --isolate -1 169_S7_L001_R1_001.trim.fastq.gz -2 169_S7_L001_R2_001.trim.fastq.gz -o spades_output
$ ls -l spades_output
Use QUAST to analyze the SPades output scaffolds fasta file:
$ quast.py -o quast_output spades_output/scaffolds.fasta
- report.txt - assessment summary in plain text format
- report.tsv - tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc)
- report.tex - LaTeX version of the summary
- icarus.html - Icarus main menu with links to interactive viewers. See section 3.4 for details
- report.pdf - all other plots combined with all tables (file is created if matplotlib python library is installed)
- report.html - HTML version of the report with interactive plots inside
- contigs_reports/ - (only if a reference genome is provided)
- misassemblies_report - detailed report on misassemblies
- unaligned_report - detailed report on unaligned and partially unaligned contigs
- k_mer_stats/ - (only if --k-mer-stats option is specified)
- kmers_report - detailed report on k-mer-based metrics
- reads_stats/ - (only if reads are provided)
- reads_report - detailed report on mapped reads statistics
$ mkdir -p ~/denovo_assembly/results/scaffolds
$ mv ~/denovo_assembly/data/trimmed_fastq/spades_output/scaffolds.fasta ~/denovo_assembly/results/scaffolds
$ cd ~/denovo_assembly/results/scaffolds
We will know take our scaffolds and use NCBI BLAST to compare our newly assembled genome to other genomes
The closest is Pseudomonas phage CMS1, complete genome (OM937766.1), with coverage of 99% and identity of 98.53%
Examination of the GenBank record for OM937766 finds that organism is "Pseudomonas phage CMS1" and the taxon ID is 2926659
Set up BWA reference mapping with the scaffold scaffolds.fasta
as reference and add the trimmed fastq files
Make sure you are in the denovo_assembly
directory and make results
directories (helps with file organization, super important!!)
$ cd ~/denovo_assembly
$ mkdir -p results/sam results/bam
$ bwa index results/scaffolds/scaffolds.fasta
Run BWA-MEM reference mapping with the indexed scaffolds.fasta
as the reference and the original trimmed fastq files as the reads:
$ bwa mem results/scaffolds/scaffolds.fasta data/trimmed_fastq/169_S7_L001_R1_001.trim.fastq.gz data/trimmed_fastq/169_S7_L001_R2_001.trim.fastq.gz > results/sam/169.aligned.sam
$ samtools view -S -b results/sam/169.aligned.sam > results/bam/169.aligned.bam
$ samtools sort -o results/bam/169.aligned.sorted.bam results/bam/169.aligned.bam
$ samtools index results/bam/169.aligned.sorted.bam
We will use our scaffolds.fasta
as the reference genome in IGV and the 169.aligned.sorted.bam
BAM file
Now we run the program Pilon
- Single base differences
- Small Indels
- Larger Indels or block substitution events
- Gap filling
- Identification of local misassemblies, including optional opening of new gaps
$ pilon --genome results/scaffolds/scaffolds.fasta --frags results/bam/169.aligned.sorted.bam --output results/scaffolds/169_improved
$ code results/scaffolds/169_improved.fasta
We will use PROKKA on the improved sequence assembly
After you have de novo assembled your genome sequencing reads into scaffolds, it is useful to know what genomic features are on those contigs
Prokka is a "wrapper"; it collects together several pieces of software (from various authors), and so avoids "re-inventing the wheel"
Prokka finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on on a sequence
- Protein coding regions on the genome are identified using Prodigal
- The function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases
Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats
- The GFF and GBK files contain all of the information about the features annotated (in different formats)
- The .txt file contains a summary of the number of features annotated
- The .faa file contains the protein sequences of the genes annotated
- The .ffn file contains the nucleotide sequences of the genes annotated
We will use a protein set specific to Pseudomonas phage PEV2 (NC_031063.1), our closely related genome from BLAST, for the annotation
First we want to download the protein coding regions of the Pseudomonas phage PEV2 (NC_031063.1) genome, we can do this from NCBI
$ mkdir -p results/annotation
$ wget -nv https://raw.githubusercontent.com/taylorpaisie/VEME_2024_NGS_Denovo_Assembly/main/NC_031063.1.faa -O ~/denovo_assembly/results/annotation/NC_031063.1.faa
$ prokka --outdir results/annotation/prokka_output --kingdom Viruses --proteins results/annotation/NC_031063.1.faa results/scaffolds/169_improved.fasta
We will use the program Artemis to visualize the genome annotation we made with PROKKA using Artemis
Artemis is a free genome browser and annotation tool that allows visualisation of sequence features, next generation data and the results of analyses within the context of the sequence, and also its six-frame translation
$ /usr/local/share/artemis/art results/annotation/prokka_output/PROKKA_08222023.gff
- There are 3 panels: feature map (top), sequence (middle), feature list (bottom)
- Click right-mouse-button on bottom panel and select Show products
- Zooming is done via the verrtical scroll bars in the two top panels