Why oh why another variant detection tool? I hear you ask. There are many high quality tools for reporting of variants from microbial paired-end sequencing, including but not limited to snippy and gatk. If you require SNP calling for phylogentics or core genome analysis I recommend that you use these tools.
However, there are cases where a simple vcf is all that is required, in particular for use in identification of acquired AMR mechanims. In addition, many tools which identify SNPs or deletions for AMR are part of large scale tools which, whilst are high quality and extremely useful, they can be complex to install, due to dependencies and run.
was written to address a very simple need - generation of a single file as output that can be used for identification of variant for AMR. It is designed to be a very lightweight tool - that simply and specifically generates a vcf file from paired-end illumina reads. It is a stripped down tool - using bwa-mem, freebayes, delly and samtools and is inspired by snippy.
Further functions may be introduced overtime (for example variant calling amplicon based sequencing or for specific genes) if others do not write a more useful tool!!
is NOT a
- Pipeline to provide hands off interpretations for AMR. If you require a tool like that - please use
- Tool designed for generation of alignments suitable for phylogenetic or core-genome analysis. If you require a tool like this - please use
When designing mutAMR
I have made some assumptions about the setup, inputs and user requirements.
Paired-end fastq files
a. It is assumed that these reads are generated from the species from which you supply a reference genome.
b. That the reads are of sufficient quality for generation of alignments
The user does not want to retain any intermediary files, such as
(you can retain these files if you like - see below for how to use). -
If running from the commandline
is being run on a per-sample basis. If you want to run it on more than one sample:a. Use a workflow language such as
- recommended.b. Use
(see below for suggested format).c. Use a for-loop to iterate over your collection.
is installed properly and you want to detect large deletions in your sequences. If not - only small deletions will be detected by freebayes - which is capable of accurately recovering deletions up to ~50-75 bp. -
is installed properly with available configs. If not - no annotation will occur, you will need to annotate yourvcf
is a python package that runs
to align reads to reference genomefreebayes
to identify variants. Note variants will be identified down to the minimum fraction designated by the user (default 0.1), see Running mutAMR.- If installed,
will be used to identify large deletions. If not installed - then small deletions will be reported as detected byfreebayes
. A combined vcf file will be generated, combining the variants detected byfreebayes
. - Annotation will be undertaken using
, to allow for simple integration with the WHO M. tuberculosis catalogue V2.
has been validated for detection of SNPS in M. tuberculosis for the purposes of AMR mechanism detection. The default settings for mutAMR
Default min depth for base calling in
is 20 reads. This is higher than what is more commonly used (10 reads). This is because when calling lower frequency mutations (<90% allele frequency), using 10 reads resulted in more false positive variant detection. Which can potentially lead to false calling of resistance. -
Min allele frequency is set to 0.1 in order to capture low frequency mutations. Allele frequencies lower than 0.1 also resulted in false postive SNP.
Validation results are published TBC.
It is highly recommended to install mutAMR
using conda
in order to prevent dependency clashes and other issues that may arise - especially if using a share computing resource.
can be installed as a conda package with all dependencies.
conda create -n mutamr mutamr
Or you can download the environment.yml
file from the root of this repository and
conda env create -f environment.yml
At a minimum you need to make sure that the required dependencies have been installed. The versions specified below have all been confirmed to work together and not cause any installation issues or unexpected behaviour. If you decide to use other versions - please be aware that behaviour may not be as described. For example samtools
version 1.21 can cause issues, whilst version 1.20 does not.
- python ==3.10
- samtools ==1.20
- bcftools ==1.20
- freebayes ==1.3.8
- bwa mem ==0.7.18
- delly ==1.2.8
- snpEff ==5.2
can be used from the commandline - or as an importable package to run as part of another python package.
Below is an example of using mutAMR
as part of an another python script or tool
Required arguments
(if you are usingmutAMR
for M. tuberculosis you can simple setmtb=True
no need to use this argument)- in fasta format
this is the species forsnpEff
(if you are usingmutAMR
for M. tuberculosis you can simple setmtb=True
no need to use this argument)threads
- default = 8
- default = 8
- boolean argument - if you would like to keep all intermediary file (inlcuding bam) set to True- default = False
- boolean argument - if set to True reference and annotation species will be automatically set- default = False
- the minimum depth required for base calling. The default is higher than standard (10) to improve the performance of base calling at lowminfrac
- default = 20
- the lowest allele frequency to call a SNP.- defaul = 0.1
- if the output folder already exists -mutAMR
will stop to prevent accidental overwriting of data. If you would like to override existingmutAMR
outputs setforce=True
- default = False
- the tmp directory forsamtools
- use of this can improve performance- default =
- default =
Example for running M. tuberculosis
from mutamr import Fastq2vcf
read1 = "/path/read1.fastq.gz"
read1 = "/path/read2.fastq.gz"
seq_id= "sample_name"
mtb = True
keep = True
V = Fastq2vcf.Fastq2Vcf(
read1 = read1,
read2= read2,
seq_id= seq_id,
keep = keep,
mtb = mtb
vcf = V.run()
This will generate a vcf file at sample_name/sample_name.annot.vcf.gz
, the variable vcf
is a string and contains the path to the vcf file for input into other tools or functions.
can also be run from the commandline
Example for running M. tuberculosis
mutamr wgs -1 /path/read1.fastq.gz -2 /path/read2.fastq.gz -s sample_name --mtb --keep
This will generate the same files as above, keeping all intermediary files.
As stated above ideally if you would like to run mutAMR
on a batch of sequences you can use a workflow language or parallel
. An example of parallel
is below.
parallel --colsep '\t' -j 8 mutamr wgs -1 {2} -2 {3} -s {1} --mtb :::: reads.txt
where reads.txt
is a tab-delimited file containing 3 columns
- Sequence ID
- Path to R1
- Path to R2