RFSC is a Reference-Free Sequence Classification Tool that using machine learning classifiers relies on an ensemble of experts in order to provide efficient classification in metagenomic contexts.
Laptop computer running Linux Ubuntu (for example, 18.04 LTS or higher) with GCC (https://gcc.gnu.org), Conda (https://docs.conda.io) and CMake (https://cmake.org) installed. The hardware must contain at least 32 GB of RAM, and a 800 GB disk. In the case of the this, if the database is not re-built, it is only needed near 10 GB of space. Furthermore, to perform installation correctly, docker and docker compose must be installed in the system (https://docs.docker.com/engine/install/ubuntu/).
git clone https://github.com/cobilab/RFSC
cd RFSC
chmod +x RFSC.sh
chmod +777 src/*.sh
docker-compose build
docker-compose up -d && docker exec -it rfsc bash && docker-compose down
./RFSC.sh --install #install tools
./RFSC.sh --build-ref-virus --build-ref-bacteria --build-ref-archaea --build-ref-protozoa \ --build-ref-fungi --build-ref-plant --build-ref-mitochondrial --build-ref-plastid
or
./RFSC.sh -dviral -dbact -darch -dprot -dfung -dplan -dmito -dplas
Obtain classification report of KNN, GNB and XGBoost.
./RFSC.sh -runAll #classification report table
./RFSC.sh -runAll F1Score # Weighted-averaged F1-score
./RFSC.sh -runAll Accuracy # Average Accuracy
./RFSC.sh -mget
./RFSC.sh -cfem
./RFSC.sh -cclm
./RFSC.sh -sget
./RFSC.sh -cfes
To perform classification of the synthetic hybrid sequences and obtain classification report of KNN, GNB and XGBoost, run the script:
./RFSC.sh -ccls
You should download the Kraken2 database at: https://benlangmead.github.io/aws-indexes/k2
To obtain the same results, use the Standard database containing "archaea, bacteria, viral, plasmid, human1, UniVec_Core" created at 5/17/2021, with 38.6GB.
./RFSC.sh -ckra
✨ Generate a synthetic sequence and subsequently proceed to a Reference-Free Reconstruction of the same:
./RFSC.sh --clean y
./RFSC.sh --threads 8 --gen-adapters
./RFSC.sh --efetch-fasta 155971 Input_Data/EntrezGenomes
./RFSC.sh --efetch-fasta EF491856.1 Input_Data/EntrezGenomes
./RFSC.sh --efetch-fasta MT682520 Input_Data/EntrezGenomes
./RFSC.sh -synt Input_Data/EntrezGenomes/155971.fna Input_Data/EntrezGenomes/EF491856.1.fna Input_Data/EntrezGenomes/MT682520.fna
./RFSC.sh -trim TT PE --run-de-novo
(If the reference databases have already been built and the Reference Free Reconstruction stage is finished)
./RFSC.sh --threads 8 --set-len-cov 100 3 --set-threshold-max-min 70 1 --run-falcon SO Viral
./RFSC.sh --threads 8 --efetch-fasta 155971 RefFree
./RFSC.sh --run-xgboost
Tool | URL |
---|---|
AC | https://github.com/cobilab/ac |
Blastn | https://blast.ncbi.nlm.nih.gov/Blast.cgi |
Cryfa | https://github.com/cobilab/cryfa |
Entrez | https://www.ncbi.nlm.nih.gov/genome |
FALCON-meta | https://github.com/cobilab/falcon |
FASTP | https://github.com/OpenGene/fastp |
GeCo3 | https://github.com/cobilab/geco3 |
GTO | https://cobilab.github.io/gto/ |
metaSPAdes | https://cab.spbu.ru/software/meta-spades/ |
ORFfinder | https://www.ncbi.nlm.nih.gov/orffinder/ |
ORFM | https://github.com/wwood/OrfM |
SoD | https://github.com/pratas/SoD.git |
Trimmomatic | http://www.usadellab.org/cms/?page=trimmomatic |
GNU GPL
✨Developed to make a change!✨