CasCollect is a pipeline for the detection of Cas genes and CRISPR arrays from unassembled raw illumina sequencing reads. This pipeline relies on python and perl as well as requiring the installation of several pieces of software within the users path:
- BBTools
- Seqtk
- FragGeneScan
- HMMER
- VSEARCH
- SPAdes
- CRISPRCasFinder
Before running, CasCollect will check for the necessary software in the users path and denote any missing software. Alternatively, the Check.py script can be run to check for the necessary software in the users path with any missing software automatically downloaded and extracted. The user is then responsible for installing each piece of software and ensuring it is in their path.
If useful, please cite: Podlevsky JD, Hudson CM, Jerilyn A. Timlin JA, Williams KP. 2020. CasCollect: Targeted Assembly of CRISPR-Associated Operons from High-throughput Sequencing Data. NAR Genom Bioinform 2: lqaa063.
-h, --help show this help message and exit
--trim runs quality trim, remove adapters, merge overlapping
reads
--clean runs quality trim, remove adapters and reference
matching reads [use: -ref file.fasta for user defined
reference file], merge overlapping reads
-ref file.fasta user defined contaminant reference genome/sequences
[default: human genome]
-fwd file.fastq fastq file of forward reads [use with -rev]
-rev file.fastq fastq file of reverse reads [use with -fwd]
-single file.fastq fastq file of unpaired reads [use in place of -fwd and
-rev]
--noprot disable search for Cas or other protein genes
withinunassembled reads for assembly [use with -hmm for
user-defined hmm file]
-hmm hmm dir directory containing user defined protein hmm file(s)
for protein search within unassembled reads [default:
Cas protein hmm]
--nucl search for CRISPR or other nucleotide sequences
withinunassembled reads for assembly [must use with
-query for user-defined fasta file]
-query file.fasta user defined fasta DNA file of sequences
--seed user-defined set of seeds for read subset expansion
-define file.fasta user-defined fasta DNA file of seed
-cycle number number of cycles for seed expansion with vsearch
-match number percent match for vsearch
--noassembly disable assembly of reads extracted from protein and/or
nucleotide searches [disables downstream annotation]
--meta run assembly for metagenomic data
--noannotate disable CRISPR arrays and Cas protein gene annotation of
assembled reads from protein and/or nucleotide searches
-cpu threads #CPUs for all tasks
-mem RAM Gb of RAM for all tasks
-out folder path to and folder name for all outputs
--version, -v show program's version number and exit
The default run for CasCollect is in protein mode.
python CasCollect.py -fwd file.fastq -rev file.fastq -out folder
python CasCollect.py -single file.fastq -out folder
This will run CasCollect for the specified fastq files to search the default 120 Cas protein HMMs with 5 cycles of seed expansion on 1 cpu and 20 Gb RAM outputting all resulting into the specified folder.
python CasCollect.py -fwd file.fastq -rev file.fastq -out folder -hmm hmm_dir
This will run CasCollect using all the HMM files in the specified directory.
python CasCollect.py -fwd file.fastq -rev file.fastq -out folder --noprot --nucl -query file.fasta
python CasCollect.py -single file.fastq -out folder --noprot --nucl -query file.fasta
This will run CasCollect only for the DNA mode, searching with the query file. No protein search.
python CasCollect.py -fwd file.fastq -rev file.fastq -out folder --noprot --seed -define file.fasta
python CasCollect.py -single file.fastq -out folder --noprot --seed -define file.fasta
This will run CasCollect only for the User-defined mode, using the define file as the seeds for seed exapnsion and assembly. No protein or DNA search.
python CasCollect.py -fwd file.fastq -rev file.fastq -out folder --nucl -query file.fasta --seed -define file.fasta
python CasCollect.py -single file.fastq -out folder --nucl -query file.fasta --seed -define file.fasta
This will run CasCollect in Protein, DNA, and User-defined mode. All the seeds generated from this mode will be combined and used for the downstream seed expansion and assembly.