-
Notifications
You must be signed in to change notification settings - Fork 180
Clustering
DIAMOND clusters protein sequences analogous to CD-HIT or UCLUST based on a user-defined clustering criterion,
finding a set of representative sequences (also called centroids) and assigning each input sequence to the cluster
of one representative such that the clustering criterion vs. the representative is fulfilled. The clustering criterion
is defined by sequence coverage of the local alignment as well its sequence identity (see below). Note that
due to the heuristic nature of the cascaded clustering algorithm, these cutoff values serve to guide the
computation, but their fulfillment is not always guaranteed, unless the recluster
workflow is used (see below).
Basic command line example:
diamond cluster -d INPUT_FILE -o OUTPUT_FILE --approx-id 30 -M 64G
When using the clustering feature, please cite:
- Buchfink B, Ashkenazy H, Reuter K, Kennedy JA, Drost HG, "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust", bioRxiv 2023.01.24.525373; doi: https://doi.org/10.1101/2023.01.24.525373
Cluster an input database of protein sequences.
-
--database/-d
The input sequence database. Supported formats are FASTA and DIAMOND (
.dmnd
) format. -
--out/-o
Output file. This is a 2-column tabular file with the representative accession as the first column and the member sequence accession as the second column. More elaborate output can be retrieved using the
realign
workflow. -
--header
Enable a header line in the output file.
-
--memory-limit/-M #
Set a memory limit for the diamond process (for example:
-M 64G
). This is not a hard upper limit and may still be exceeded in certain cases. Decrease this number in case the tool fails due to running out of memory. Note that higher numbers increase the performance by a lot, so it is strongly recommended to always set this option. Note that this option affects the algorithm and therefore the results. Clustering is a heuristic procedure with no unique solution. Note that higher numbers also increase the use of temporary disk space. -
--approx-id #
The identity cutoff for the clustering (in %). Note that for performance reasons, the setting refers to the approximate sequence identity derived as a linear regression from the bitscore, not the actual number of identities in the alignment. The default value is 50% when running
diamond cluster
and 0% when runningdiamond deepclust
. -
--member-cover #
The minimum coverage of the cluster member sequence by the representative (in %). This is a unidirectional coverage i.e. a minimum coverage of the representative is not required. The default is 80%.
-
--no-block-size-limit
Do not limit the block size to recommended maximums.
-
--cluster-steps
Set the sequence of clustering rounds for cascaded clustering as a space-separated list. Permitted keywords are the sensitivity switches of the alignment workflow (e.g.
sensitive
). The suffix_lin
can be appended to trigger linearization of the search (e.g.faster_lin fast default sensitive very-sensitive
). When missing, this parameter is automatically chosen based on the--approx-id
parameter.
Given a clustering computed by the cluster
workflow as input, this workflow computes alignments of
all sequences in the original database against their assigned representative sequences.
-
--clusters
The clustering as 2-column tabular format. -
--outfmt/-f
Set the output format. Only tabular format is supported for this workflow. The default corresponds to the format-f 6 qseqid sseqid approx_pident qstart qend sstart send evalue bitscore
of the alignment workflow, where the query and subject correspond to the representative and the cluster member sequence respectively.
These parameters of the cluster
workflow apply accordingly: --database/-d
, --out/-o
, --header
,
--memory-limit/-M
.
Fixes errors in a given clustering where a cluster member sequence does not satisfy the clustering criterion against its representative. Such errors may arise due to the heuristic nature of the cascaded clustering algorithm due to the merging of clusters based on alignments of their representative sequences.
These parameters of the cluster
workflow apply accordingly: --database/-d
, --out/-o
, --header
,
--memory-limit/-M
, --approx-id
, --no-block-size-limit
, --member-cover
.
For a given clustering, attempts to reassign all non-representative sequences to the closest representative sequence that satisfies the clustering criterion as measured by the e-value of the local alignment.
These parameters of the cluster
workflow apply accordingly: --database/-d
, --out/-o
, --header
,
--memory-limit/-M
, --approx-id
, --no-block-size-limit
, --member-cover
.
Compute greedy vertex cover clustering based on alignment input.
-
--edges
Input file containing alignments/graph edges for clustering. By default, a TSV file with 5 columns is expected: query target query-cover target-cover edge-weight. -
--database/-d
A TSV file whose first column needs to be a list of all accessions that occur in the edges file as either query or target. This must not be a sequence database file. -
--edge-format (triplet)
Enable triplet edge format: query target edge-weight. The semantic is unidirectional representation of the query by the target. -
--centroid-out
Output file for representative list. -
--out/-o
The output clustering as a 2-column TSV format. This file does not group clusters together.
These parameters of the cluster
workflow apply accordingly: --header
, --member-cover
.
Many (but not all) options of the alignment workflow can also be used for the clustering workflows,
e.g. --threads/-p
, --evalue/-e
.