-
Notifications
You must be signed in to change notification settings - Fork 55
Task: run
This runs the main ARIBA local lassembly pipeline.
Assuming ariba prepareref
has been run, with the output directory called
ref
, run the pipeline with
ariba run ref reads_1.fq reads_2.fq output_dir
where the reads_1.fq
, reads_2.fq
are the names of the forwards and
reverse paired reads files. The reads files can be in any format
that is compatible with bowtie2 (in particular, gzipped).
Important: ARIBA assumes that read N in the file reads_1.fq
is the
mate of read N in the file reads_2.fq
.
All output files will be put in a new directory called out_dir
.
To see all the options, use --help
:
ariba run --help
The most important file is report.tsv
. This is a filtered
version of the complete file report.all.tsv
, which has at least one row
per reference sequence that had reads mapped to it (see the task
reportfilter for more details on the filtering).
The meaning of the columns in report.tsv
is as follows.
Column | Description |
---|---|
1. ref_name | name of reference sequence chosen from cluster |
2. ref_type | type of reference sequence (presence/absence, variants only, noncoding) |
3. flag | cluster [[flag |
4. reads | number of reads in this cluster |
5. cluster | name of cluster |
6. ref_len | length of reference sequence |
7. ref_base_assembled | number of reference nucleotides assembled by this contig |
8. pc_ident | %identity between reference sequence and contig |
9. ctg | name of contig matching reference |
10. ctg_len | length of contig |
11. ctg_cov | mean mapped read depth of this contig |
12. known_var | is this a known SNP from reference metadata? 1 |
13. var_type | The type of variant. Currently only SNP supported |
14. var_seq_type | Variant sequence type. if known_var=1, n |
15. known_var_change | if known_var=1, the wild/variant change, eg I42L |
16. has_known_var | if known_var=1, 1 |
17. ref_ctg_change | amino acid or nucleotide change between reference and contig, eg I42L |
18. ref_ctg_effect | effect of change between reference and contig, eg SYS, NONSYN (amino acid changes only) |
19. ref_start | start position of variant in contig |
20. ref_end | end position of variant in contig |
21. ref_nt | nucleotide(s) in contig at variant position |
22. ctg_start | start position of variant in contig |
23. ctg_end | end position of variant in contig |
24. ctg_nt | nucleotide(s) in contig at variant position |
25. smtls_total_depth | total read depth at variant start position in contig, reported by mpileup |
26. smtls_alt_nt | alt nucleotides on contig, reported by mpileup |
27. smtls_alt_depth | alt depth on contig, reported by mpileup |
28. var_description | description of variant from reference metdata |
29. free_text | other free text about reference sequence, from reference metadata |
If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1-11 (and possibly 29) and the remaining columns are dots. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:
cut -f1,3 report.tsv | uniq
The other files written to the output directory are as follows.
-
assembled_seqs.fa.gz
. This is a gzipped FASTA of the assembled sequences. During assembly, the sequence flanking each reference sequence is assembled, but in this file only the parts of the contigs that match the reference sequences are kept. -
log.clusters.gz
. Detailed logging is kept for the progress of each cluster. This is a gzipped file containing all the logging information. -
version_info.txt
. This contains detailed information on the versions of ARIBA and its dependencies. It is the output of running the task version.