-
Notifications
You must be signed in to change notification settings - Fork 55
Task: run
This runs the main ARIBA local assembly pipeline.
Assuming ariba prepareref
has been run, with the output directory called
ref
, run the pipeline with
ariba run ref reads_1.fq reads_2.fq output_dir
where the reads_1.fq
, reads_2.fq
are the names of the forwards and
reverse paired reads files. The reads files can be in any format
that is compatible with
minimap
(in particular, gzipped).
Important: ARIBA assumes that read N in the file reads_1.fq
is the
mate of read N in the file reads_2.fq
.
All output files will be put in a new directory called out_dir
.
To see all the options, use --help
:
ariba run --help
The most important file is report.tsv
. This is a filtered
version of the complete file report.all.tsv
, which has at least one row
per reference sequence that had reads mapped to it (see the task
reportfilter for more details on the filtering).
The meaning of the columns in report.tsv
is as follows.
Column | Description |
---|---|
1. ref_name | name of reference sequence chosen from cluster |
2. gene | 1=gene, 0=non-coding (same as metadata column 2) |
3. var_only | 1=variant only, 0=presence/absence (same as metadata column 3) |
4. flag | cluster flag |
5. reads | number of reads in this cluster |
6. cluster | name of cluster |
7. ref_len | length of reference sequence |
8. ref_base_assembled | number of reference nucleotides assembled by this contig |
9. pc_ident | %identity between reference sequence and contig |
10. ctg | name of contig matching reference |
11. ctg_len | length of contig |
12. ctg_cov | mean mapped read depth of this contig |
13. known_var | is this a known SNP from reference metadata? 1 or 0 |
14. var_type | The type of variant. Currently only SNP supported |
15. var_seq_type | Variant sequence type. if known_var=1, n or p for nucleotide or protein |
16. known_var_change | if known_var=1, the wild/variant change, eg I42L |
17. has_known_var | if known_var=1, 1 or 0 for whether or not the assembly has the variant |
18. ref_ctg_change | amino acid or nucleotide change between reference and contig, eg I42L |
19. ref_ctg_effect | effect of change between reference and contig, eg SYS, NONSYN (amino acid changes only) |
20. ref_start | start position of variant in contig |
21. ref_end | end position of variant in contig |
22. ref_nt | nucleotide(s) in contig at variant position |
23. ctg_start | start position of variant in contig |
24. ctg_end | end position of variant in contig |
25. ctg_nt | nucleotide(s) in contig at variant position |
26. smtls_total_depth | total read depth at variant start position in contig, reported by mpileup |
27. smtls_alt_nt | alt nucleotides on contig, reported by mpileup |
28. smtls_alt_depth | alt depth on contig, reported by mpileup |
29. var_description | description of variant from reference metdata |
30. free_text | other free text about reference sequence, from reference metadata |
If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1-12 (and possibly 30) and the remaining columns are dots. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:
cut -f1,4 report.tsv | uniq
The other files written to the output directory are as follows.
-
assembled_genes.fa.gz
. This is a gzipped FASTA file of assembled gene sequences. It does not contain non-coding sequences (those are inassembled_seqs.fa.gz
), only genes. When comparing a local assembly to a gene, mismatches near the end of the gene can cause the alignment to be too short. ARIBA tries to extend the match by looking for start and stop codons. The extended sequences are in this file. The not extended sequences are inassembled_seqs.fa.gz
. -
assembled_seqs.fa.gz
. This is a gzipped FASTA of the assembled sequences. During assembly, the sequence flanking each reference sequence is assembled, but in this file only the parts of the contigs that match the reference sequences are kept. -
assemblies.fa.gz
. This is a gzipped FASTA file of the assemblies. It contains the complete, unedited, contigs. -
log.clusters.gz
. Detailed logging is kept for the progress of each cluster. This is a gzipped file containing all the logging information. -
version_info.txt
. This contains detailed information on the versions of ARIBA and its dependencies. It is the output of running the task version.