Skip to content
martinghunt edited this page Aug 3, 2016 · 28 revisions

Task: run

This runs the main ARIBA local assembly pipeline.

Assuming ariba prepareref has been run, with the output directory called ref, run the pipeline with

ariba run ref reads_1.fq reads_2.fq output_dir

where the reads_1.fq, reads_2.fq are the names of the forwards and reverse paired reads files. The reads files can be in any format that is compatible with minimap (in particular, gzipped).

Important: ARIBA assumes that read N in the file reads_1.fq is the mate of read N in the file reads_2.fq. All output files will be put in a new directory called out_dir.

To see all the options, use --help:

ariba run --help

Report file

The most important file is report.tsv. This is a filtered version of the complete file report.all.tsv, which has at least one row per reference sequence that had reads mapped to it (see the task reportfilter for more details on the filtering).

The meaning of the columns in report.tsv is as follows.

Column Description
1. ref_name name of reference sequence chosen from cluster
2. gene 1=gene, 0=non-coding (same as metadata column 2)
3. var_only 1=variant only, 0=presence/absence (same as metadata column 3)
4. flag cluster flag
5. reads number of reads in this cluster
6. cluster name of cluster
7. ref_len  length of reference sequence
8. ref_base_assembled number of reference nucleotides assembled by this contig
9. pc_ident %identity between reference sequence and contig
10. ctg name of contig matching reference
11. ctg_len length of contig
12. ctg_cov mean mapped read depth of this contig
13. known_var is this a known SNP from reference metadata? 1 or 0
14. var_type The type of variant. Currently only SNP supported
15. var_seq_type Variant sequence type. if known_var=1, n or p for nucleotide or protein
16. known_var_change if known_var=1, the wild/variant change, eg I42L
17. has_known_var if known_var=1, 1 or 0 for whether or not the assembly has the variant
18. ref_ctg_change amino acid or nucleotide change between reference and contig, eg I42L
19. ref_ctg_effect effect of change between reference and contig, eg SYS, NONSYN (amino acid changes only)
20. ref_start start position of variant in contig
21. ref_end end position of variant in contig
22. ref_nt nucleotide(s) in contig at variant position
23. ctg_start start position of variant in contig
24. ctg_end end position of variant in contig
25. ctg_nt nucleotide(s) in contig at variant position
26. smtls_total_depth  total read depth at variant start position in contig, reported by mpileup
27. smtls_alt_nt alt nucleotides on contig, reported by mpileup
28. smtls_alt_depth alt depth on contig, reported by mpileup
29. var_description description of variant from reference metdata
30. free_text other free text about reference sequence, from reference metadata

If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1-12 (and possibly 30) and the remaining columns are dots. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:

cut -f1,4 report.tsv | uniq

Other files

The other files written to the output directory are as follows.

  • assembled_genes.fa.gz. This is a gzipped FASTA file of assembled gene sequences. It does not contain non-coding sequences (those are in assembled_seqs.fa.gz), only genes. When comparing a local assembly to a gene, mismatches near the end of the gene can cause the alignment to be too short. ARIBA tries to extend the match by looking for start and stop codons. The extended sequences are in this file. The not extended sequences are in assembled_seqs.fa.gz.

  • assembled_seqs.fa.gz. This is a gzipped FASTA of the assembled sequences. During assembly, the sequence flanking each reference sequence is assembled, but in this file only the parts of the contigs that match the reference sequences are kept.

  • assemblies.fa.gz. This is a gzipped FASTA file of the assemblies. It contains the complete, unedited, contigs.

  • log.clusters.gz. Detailed logging is kept for the progress of each cluster. This is a gzipped file containing all the logging information.

  • version_info.txt. This contains detailed information on the versions of ARIBA and its dependencies. It is the output of running the task version.

Clone this wiki locally