The Recursive Binary Divisor clusters sequences using a density-based algorithm. For this, the script uses the mnhn-tree-tools (Haschka T. 2021) in a specific way. It first operates the count of kmers, then applies a pca on these counts. The projection is used as a support for dbscan: distance between each point is calculated; if enough points are within distance epsilon of each-others, the density threshold is reached and a cluster is found. The density parameter epsilon is reduced until exactly 2 clusters are found. The sequences forming each cluster are separated, the others are left out. The steps count/pca/projection/density-based-clustering are repeated on each precendently found clusters. The final result is a tree. Its leefs are the clusters that are visualised.
- Dependencies:
Make sure you have a GCC compiler version >= 4.9.2, pip and the R language installed.
For MNHN-Tree-Tools:
sudo apt-get install git build-essential libpng-dev libsdl2-dev liblapack-dev libopenmpi-dev libpocl-dev ocl-icd-opencl-dev pocl-opencl-icd
git clone https://github.com/Louis-MG/MNHN-Tree-Tools.git
cd MNHN-Tree-Tools
mkdir bin
make all
cd bin
# make MNHN-Tree-Tools available from any folder temporarily
export PATH=$PATH:$PWD
For RBD:
#if you dont have pip installed already (UNIX):
pip install regex shutils numpy subprocess argparse textwrap
For rbd_pca.r, open R in the console and paste:
install.packages('ggplot2', 'seqinr')
For rdb.py
:
usage: script.py [-h] --fasta_file FASTAFILE --epsilon EPSILON [--delta_epsilon DELTA] [--growth] [--min_points MINPOINTS] [--dim_pca DIMPCA] [--kmer_len KMER] [--threads THREADS] [--verbose] --output OUTPUT [--minsize MINSIZE]
GENERAL:
--fasta_file FASTAFILE, -f FASTAFILE Name of the fasta file containing the sequences to cluster during the different steps.
--threads THREADS, -n THREADS Number of threads used by mnhn-tree-tools for the different calculation steps (default: 4).
--verbose, -v Shows progression messages in standard output.
--output OUTPUT, -o OUTPUT Specifies output file name if needed.
--help, -h displays this help message and exits
PCA:
--kmer_len KMER, -k KMER Length of the kmers processed by mnhn-tree-tools from the fastafile for the different calculation steps. Default: 5
--dim_pca DIMPCA, -p DIMPCA Number of dimensions to use from the pca during the calculation steps (default: 7).
CLUSTERING:
--epsilon EPSILON, -e EPSILON Minium value of epsilon used for the different calculation steps. Float.
--delta_epsilon DELTA, -d DELTA Minium difference between each epsilon of the dichotomy: when difference is inferior to delta, loop stops (default: 0.001).
--growth, -g Gives directionnality of epsilon: increase or decrease. Default: 0.
--min_points MINPOINTS, -m MINPOINTS Minimun number of sequences to find to form a cluster during the clustering steps (default: 3).
--minsize MINSIZE, -s MINSIZE Specifies minimum size of a cluster to try to subcluster it. Default: 50.
For seq_hilight.py
:
Options:
-h, --help show this help message and exit
--seq SEQ, -s SEQ Input file with one sequence per line
--fasta FASTA, -f FASTA Input reference fasta file
--output OUTPUT, -o OUTPUT Name of the output file
--rbd RBD, -r RBD Input file which is the sequence_parameters.txt file from rbd output
e.g. :
python3 rbd.py -f test.fasta -e 0.5 -d 0.01 -o output_rbd -g
Rscript rbd_pca.r output_rbd
python3 seq_highlight.py -s seq_of_interest.txt -f test.fst -o seq_highlight.txt -r sequence_parameters.txt
The rbd program outputs 3 tabulated files and several folders. Each folder corresponds to a cluster; it contains: the fasta file containing the sequences, the kmer counts of the fasta, the eigen values file and the pca file obtained from the count file.
- The first tabulated text file is named cluster_parameters.txt, where parameters is the chain of carachters of the main arguments (kmer length, epsilon, delta epsilon, minpoints, dimpca). Its header and content is the following:
cluster_name | epsilon | father_size | son1_size | son2_size |
---|---|---|---|---|
cluster | 1.2 | 1000 | 400 | 500 |
cluster.1 | 1.5 | 400 | 200 | 100 |
cluster.1.1 | -1 | 200 | NONE | NONE |
cluster.1.2 | -5 | 100 | NONE | NONE |
Epsilon is either a positive float (the real epsilon) or a negative integer that indicates an error code: -1 signifies no or just one sub-cluster was found.
- The second tabulated file is sequence_parameters, where parameters is written as described above. It contains the names of the sequences and the respective name of the last cluster it belonged to. Exemple given:
sequences_names | cluster_name |
---|---|
sequenceXXXXXXX | cluster |
sequenceYYYYYYY | cluster.1.2.1 |
sequenceZZZZZZZ | cluster.2.1 |
sequenceAAAAAAA | cluster.1.0 |
- The third file is a summary of sequence_parameters, conveniently named sequence_summary.txt . It contains the number of sequences remaining in a cluster. In the case of a leaf, the number corresponds to the number of sequence in the leaf. In case of a branch, the number corresponds to the sequences fromn the custer that were not assigned to a sub-cluster; these are orphans and are denominated with a .0 added ot the cluster name.
Sequences left | cluster_name |
---|---|
240 | cluster.0 |
45 | cluster.1.0 |
30 | cluster.1.2 |
12 | cluster.2 |
- The R script yields 3 pca plots of the sequences for each parent cluster (clusters that are divided by rbd). Dimensions 1 and 2, 2 and 3, 1 and 3 are used. 3 additionnal plots are produced, showing final clusters in colors only. They are placed at the output folder's root.
- The seq_highlight.py script yields a tabulated file that contains the ids of the sequences of interest and their last cluster belonging assigned by rbd.
sequence_id | cluster_name |
---|---|
sequenceXXXXXXX | cluster |
sequenceUUUUUUU | cluster.1.1.2 |
sequenceOOOOOOO | cluster.1.1.1.1 |