-
Notifications
You must be signed in to change notification settings - Fork 55
Task: prepareref
This prepares reference data for use with the task run
.
The input can be either provided by the user, or downloaded
using the task [[getref
|Task:-getref]].
prepareref
will do the following:
-
Sanity check the input fasta file(s) and metadata file(s). Inconsistent/bad data will be removed and reported in log files. If this happens, a warning will be written to stderr. It is important to check any removed sequences and/or variants. If you are missing a gene from your final output after running ARIBA, please check that it was not removed by
prepareref
. -
Run cd-hit on the sequences that pass stage 1. An independent run of cd-hit is carried out for each the four groups of reference sequences: non-coding and variant only, non-coding and presence absence, gene and vairant only, gene and presence/absence.
The output directory it makes is then used as input to
the task run
.
Assuming ariba getref
was run with the output files prefix
getref_out
, then run prepareref
like this:
ariba prepareref -f getref_out.fa -m getref_out.tsv prepareref.out
to make a new directory called prepareref.out
, which can be
used as input to the task [[run
|Task:-run]]. For example, to
get and use ARG-ANNOT data:
ariba getref argannot getref.out
ariba prepareref -f getref.out.fa -m getref.out.tsv prepareref.out
and the new directory preareref.out
can be used as input to ariba run
.
If you have a file of genes called in.fasta
, the usage is:
ariba prepareref --all_coding yes -f in.fasta out_dir
or if all the sequences are non-coding:
ariba prepareref --all_coding no -f in.fasta out_dir
If you have a mix of genes and non-coding sequences, or associated variants of interest, this information must be supplied in a metadata file. See the next section.
The usage is:
ariba prepareref -f file.fa -m metadata.tsv out_dir
where the reference sequences are in the FASTA file file.fa
, and
their associated metadata is in metadata.tsv
. Every sequence in
the FASTA file must have at least one corresponding entry in the
metadata file.
Reference sequences can either be non-coding or genes (but must always be nucleotide sequences). Gene sequences will have extra analysis performed, for example looking for non-synonymous amino acid changes.
Additionally, reference sequences can be either of:
-
Presence/absence sequences. ARIBA will look for these sequences in the input reads and report any that it finds, and also any variants between these sequences and the input reads.
-
Variants only sequences. These should have known variant details specified in the metadata file (see below). ARIBA reports only when it finds at least one of the given variants in each of these these sequences. If a sample has one of these sequences, but does not have one of the supplied variants, then it is not reported. If you supply a variants only sequence, but no variant, then the sequence will be removed during sanity checks.
In case you have sequences and metadata split over multiple files, the
options -f
and -m
can be specified more than once. For example:
ariba prepareref -f in.1.fa -f in.2.fa -f in.3.fa \
-m meta.1.tsv -m meta.2.tsv output_directory
Each sequence must have associated metadata in one (or more) of the
metadata files - it does not matter which file. For example, a sequence
in file in.1.fa
must have metadata in meta.1.tsv
or
meta.2.tsv
(or in both) - it does not have to be in meta.1.tsv
.
The sequence names must be unique across all input files. Sequence names in the FASTA file(s) must match exactly to the names in the metadata.
A metadata file (01.filter.check_metadata.tsv
) is comprised of reference sequences that passed the sanity check at the stage 1 of prepareref
. The file must be tab-delimited and consist of six columns:
-
Sequence name. This must match the name used in the FASTA file.
-
A
0
or1
to indicate whether or not this is a gene. Use0
for a non-coding sequence and1
for a gene. -
A
0
or1
to indicate if this is a presence/absence sequence or a variants only sequence. Use0
for presence/absence and1
for variants only. -
If this line is describing a variant, put it here in the form <wild type><position><variant type>, for example K10L. Put a "." if this line is not describing a variant. The reference sequence must have either the wild type or variant type at the given position, otherwise the variant will be removed during sanity checks. If the sequence is a gene (
1
in column 2), this is taken to be an amino acid change. If the sequence is non-coding (0
in column 2), this is taken to be a nucleotide change. -
Variants can be put into groups. This is the group name for this variant. To not put it into a group, use ".". This can be useful when summarising across several runs. Suppose there are two alleles for a gene, both of which confer the same resistance if they have a particular SNP. Putting those SNPs in the same group allows ARIBA to track them as a group and simply report whether each sample has any variant from the group at the summary stage.
-
Free text that can be used to describe the sequence and/or the variant. Put a "." if you do not want to provide a description.
More than one line can be used for each reference sequence, for when there are multiple variants or descriptions.
Here are some example lines:
-
To just describe the sequence, but no variant:
sequence1 1 0 . . this is a description of sequence1
ie this is a presence/absence gene and we are simply giving a description, not a variant.
-
A nucleotide variant:
sequence2 0 0 A42G group1 description of variant
ie this is a non-coding sequence where the wild type is an A at position 42 and we are interested if there is a G at that position. We have put the variant into a group called "group1".
-
An amino acid variant in a gene
sequence3 1 1 I10L . .
ie the wild type is an I at position 10 of the amino acid sequence. The 1 in column 3 indicates that it is a variants-only gene, which means we are only interested in the given amino acid change. In this example, no description and no group have been given.
In addition to the metadata file, prepareref
produces a log file (01.filter.check_metadata.log
) to list the removed reference sequences (see output file 01.filter.check_genes.log
for their exact sequences).
When the ARIBA pipeline is run, it must give each cluster a name. This is the method used to name the clusters:
-
Gather everything up to the first dot (.) in the name of each sequence in the cluster.
-
If everything in (1) is the same, use that name. Otherwise, look for a common prefix for everything in (1). If there is a common prefix, then use it, but add "-" on the end of the common prefix.
-
If there was no common prefix, take the most common string from (1), and add "+" on the end to indicate there were other names.
However, cluster names must be at least 3 characters long. If everything above fails, then the cluster is simply called "cluster". To avoid non-unique cluster names, -1, -2, -3 etc is appended to the cluster names where necessary.
When downloading a supported reference using getref
, we attempt
to name the sequences in such a way as to result in "sensible" cluster names.
This usually works, however, there are always a few sequence names where
we cannot automatically determine a sensible name.
The following options affect the clustering using CD-HIT:
-
--no_cdhit
. Using this will prevent clustering. Instead, each input sequence is put into its own "cluster". Incompatible with--cdhit_clusters
. -
--cdhit_clusters FILENAME
. Use this to specify how the sequences should be clustered, instead of using CD-HIT. The file format is one cluster per line. Sequence names separated by whitespace. Incompatible with--no_cdhit
. -
--cdhit_min_id FLOAT
. Sequence identity threshold (cd-hit option -c). Default: 0.9. -
--cdhit_min_length FLOAT
. Length difference cutoff (cd-hit option -s). Default: 0. -
--cdhit_max_memory INT
. Memory limit in MB (cd-hit option -M). Use 0 for unlimited.
Miscellaneous options:
-
--min_gene_length INT
. Minimum allowed length in nucleotides of reference genes. Genes shorter than this are removed. Default: 6. -
--max_gene_length INT
. Maximum allowed length in nucleotides of reference genes. Genes longer than this are removed. Default: 10000. -
--min_noncoding_length INT
. Minimum allowed length in nucleotides of non-coding sequences. Sequences shorter than this are removed. Default: 6. -
--max_noncoding_length INT
. Maximum allowed length in nucleotides of non-coding sequences. Sequences longer than this are removed. Default: 20000. -
--genetic_code INT
. Number of genetic code to use. Currently supported 1,4,11. Default: 11. -
--force
. Overwrite output directory, if it already exists. Without this option, output directory must not already exist. -
--threads INT
. Number of threads (currently only applies to cdhit). Default: 1. -
--verbose
. Be verbose. Silent by default, unless there are errors.
If you get the following error:
Warning:
Some seqs are too long, please rebuild the program with make parameter MAX_SEQ=new-maximum-length (e.g. make MAX_SEQ=10000000)
Not fatal, but may affect results !!
....
Fatal Error:
in diag_test_aapn_est, MAX_DIAG reached
Program halted !!
Then please see issue #278 for the workaround.