-
Notifications
You must be signed in to change notification settings - Fork 55
Task: run
This runs the main ARIBA local assembly pipeline.
Assuming ariba prepareref
has been run, with the output directory called
ref
, run the pipeline with
ariba run ref reads_1.fq reads_2.fq output_dir
where the reads_1.fq
, reads_2.fq
are the names of the forwards and
reverse paired reads files. The reads files can be in any format
that is compatible with
minimap
(in particular, gzipped).
Important: ARIBA assumes that read N in the file reads_1.fq
is the
mate of read N in the file reads_2.fq
.
All output files will be put in a new directory called out_dir
.
To see all the options, use --help
:
ariba run --help
The most important file is report.tsv
, an example of which
can be found here.
The meaning of the columns in report.tsv
is as follows.
Column | Description |
---|---|
1. ariba_ref_name | ariba name of reference sequence chosen from cluster (needs to rename to stop some tools breaking) |
2. ref_name | original name of reference sequence chosen from cluster, before renaming |
3. gene | 1=gene, 0=non-coding (same as metadata column 2) |
4. var_only | 1=variant only, 0=presence/absence (same as metadata column 3) |
5. flag | cluster flag |
6. reads | number of reads in this cluster |
7. cluster | name of cluster |
8. ref_len | length of reference sequence |
9. ref_base_assembled | number of reference nucleotides assembled by this contig |
10. pc_ident | %identity between reference sequence and contig |
11. ctg | name of contig matching reference |
12. ctg_len | length of contig |
13. ctg_cov | mean mapped read depth of this contig |
14. known_var | is this a known SNP from reference metadata? 1 or 0 |
15. var_type | The type of variant. Currently only SNP supported |
16. var_seq_type | Variant sequence type. if known_var=1, n or p for nucleotide or protein |
17. known_var_change | if known_var=1, the wild/variant change, eg I42L |
18. has_known_var | if known_var=1, 1 or 0 for whether or not the assembly has the variant |
19. ref_ctg_change | amino acid or nucleotide change between reference and contig, eg I42L |
20. ref_ctg_effect | effect of change between reference and contig, eg SYS, NONSYN (amino acid changes only) |
21. ref_start | start position of variant in reference |
22. ref_end | end position of variant in reference |
23. ref_nt | nucleotide(s) in reference at variant position |
24. ctg_start | start position of variant in contig |
25. ctg_end | end position of variant in contig |
26. ctg_nt | nucleotide(s) in contig at variant position |
27. smtls_total_depth | total read depth at variant start position in contig, reported by mpileup |
28. smtls_nts | nucleotides on contig, as reported by mpileup. The first is the contig nucleotide |
29. smtls_nts_depth | depths on contig, as reported by mpileup. One number per nucleotide in the previous column |
30. var_description | description of variant from reference metdata |
31. free_text | other free text about reference sequence, from reference metadata |
If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1-12 (and possibly 30) and the remaining columns are dots. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:
cut -f1,4 report.tsv | uniq
The other files written to the output directory are as follows.
-
assembled_seqs.fa.gz
. This is a gzipped FASTA of the assembled sequences. During assembly, the sequence flanking each reference sequence is assembled, but in this file only the parts of the contigs that match the reference sequences are kept. Furthermore, since no filter of nucleotide identity or reference coverage is applied when generating this file, partial matches from contigs to reference sequences are also included. -
assembled_genes.fa.gz
. This is a gzipped FASTA file of assembled gene sequences. It does not contain non-coding sequences (those are inassembled_seqs.fa.gz
), only genes. When comparing a local assembly to a gene, mismatches near the end of the gene can cause the alignment to be too short. ARIBA tries to extend the match by looking for start and stop codons. The extended sequences are in this file, whereas the unextended sequences are inassembled_seqs.fa.gz
. -
assemblies.fa.gz
. This is a gzipped FASTA file of the assemblies. It contains the complete, unedited, contigs. Since flanking regions of each reference sequence is assembled, the contig is usually longer than its matching reference sequence. As a result, file sizes should often show the following relationship:assemblies.fa.gz
>assembled_genes.fa.gz
andassembled_seqs.fa.gz
, and sometimesassembled_seqs.fa.gz
>assembled_genes.fa.gz
. -
debug.report.tsv
. Not only does this file contain all rows ofreport.tsv
, but also includes rows about synonymous mutations. -
log.clusters.gz
. Detailed logging is kept for the progress of each cluster. This is a gzipped file containing all the logging information. -
version_info.txt
. This contains detailed information on the versions of ARIBA and its dependencies. It is the output of running the task version.
nucmer options:
-
--nucmer_min_id INT
. Minimum alignment identity (the delta-filter -i option). Default: 90. -
--nucmer_min_len INT
. Minimum alignment length (the delta-filter -l option). Default: 20. -
--nucmer_breaklen INT
. Value to use for-breaklen
when running nucmer. Default: 200.
Assembly options:
-
--assembly_cov INT
. Target read coverage when sampling reads for assembly. The reads are randomly downsampled to the given coverage for assembly. If this option is more than the actual read coverage, then all reads are used for assembly. Default: 50. -
--min_scaff_depth INT
. Minimum number of read pairs needed as evidence for scaffold link between two contigs. Default: 10. -
--assembler {fermilite,spades}
. Default: fermilite. Spades will need to be installed on the host. -
--spades_mode {wgs,sc,rna}
. If using Spades assembler, either use default WGS mode, Single Cell mode (spades.py --sc) or RNA mode (spades.py --rna). Use SC or RNA mode if your input is from a viral sequencing with very uneven and deep coverage. Set --assembly_cov to some high value if using SC or RNA mode. -
--spades_options SPADES_OPTIONS
. Extra options to pass to Spades assembler. Sensible default options will be picked based on --spades_mode argument. Anything set here will replace the defaults completely. Example:--spades_options '--phred-offset 33'
Other options:
-
--threads INT
. Experimental. Number of threads. Will run clusters in parallel, but not minimap. Default: 1. -
--assembled_threshold FLOAT (between 0 and 1)
. If proportion of gene assembled (regardless of into how many contigs) is at least this value then the flaggene_assembled is set
. Default: 0.95. -
--gene_nt_extend INT
. Max number of nucleotides to extend ends of gene matches to look for start/stop codons. Default: 30. -
--unique_threshold FLOAT (between 0 and 1)
. If proportion of bases in gene assembled more than once is <= this value, then the flagunique_contig
is set. Default: 0.03. -
--force
. Overwrite output directory, if it already exists -
--noclean
. Do not clean up intermediate files -
--tmp_dir TMP_DIR
. Existing directory in which to create a temporary directory used for local assemblies. -
--verbose
. Be verbose.