Skip to content
martinghunt edited this page May 5, 2016 · 28 revisions

Task: run

This runs the main ARIBA local lassembly pipeline.

Assuming ariba prepareref has been run, with the output directory called ref, run the pipeline with

ariba run ref reads_1.fq reads_2.fq output_dir

where the reads_1.fq, reads_2.fq are the names of the forwards and reverse paired reads files. The reads files can be in any format that is compatible with bowtie2 (in particular, gzipped).

Important: ARIBA assumes that read N in the file reads_1.fq is the mate of read N in the file reads_2.fq. All output files will be put in a new directory called out_dir.

To see all the options, use --help:

ariba run --help

Report file

The most important file is report.tsv. This is a filtered version of the complete file report.all.tsv, which has at least one row per reference sequence that had reads mapped to it (see the task reportfilter for more details on the filtering).

The meaning of the columns in report.tsv is as follows.

Column Description
1. ref_name name of reference sequence chosen from cluster
2. ref_type type of reference sequence (presence/absence, variants only, noncoding)
3. flag cluster [[flag
4. reads number of reads in this cluster
5. cluster name of cluster
6. ref_len  length of reference sequence
7. ref_base_assembled number of reference nucleotides assembled by this contig
8. pc_ident %identity between reference sequence and contig
9. ctg name of contig matching reference
10. ctg_len length of contig
11. ctg_cov mean mapped read depth of this contig
12. known_var is this a known SNP from reference metadata? 1
13. var_type The type of variant. Currently only SNP supported
14. var_seq_type Variant sequence type. if known_var=1, n
15. known_var_change if known_var=1, the wild/variant change, eg I42L
16. has_known_var if known_var=1, 1
17. ref_ctg_change amino acid or nucleotide change between reference and contig, eg I42L
18. ref_ctg_effect effect of change between reference and contig, eg SYS, NONSYN (amino acid changes only)
19. ref_start start position of variant in contig
20. ref_end end position of variant in contig
21. ref_nt nucleotide(s) in contig at variant position
22. ctg_start start position of variant in contig
23. ctg_end end position of variant in contig
24. ctg_nt nucleotide(s) in contig at variant position
25. smtls_total_depth  total read depth at variant start position in contig, reported by mpileup
26. smtls_alt_nt alt nucleotides on contig, reported by mpileup
27. smtls_alt_depth alt depth on contig, reported by mpileup
28. var_description description of variant from reference metdata
29. free_text other free text about reference sequence, from reference metadata

If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1-11 (and possibly 29) and the remaining columns are dots. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:

cut -f1,3 report.tsv | uniq

Other files

The other files written to the output directory are as follows.

  • assembled_seqs.fa.gz. This is a gzipped FASTA of the assembled sequences. During assembly, the sequence flanking each reference sequence is assembled, but in this file only the parts of the contigs that match the reference sequences are kept.

  • log.clusters.gz. Detailed logging is kept for the progress of each cluster. This is a gzipped file containing all the logging information.

  • version_info.txt. This contains detailed information on the versions of ARIBA and its dependencies. It is the output of running the task version.

Clone this wiki locally