Symbiont-Screener is a reference-free approach to identifying high-confidence host's long reads from symbionts and contaminants and overcoming the low sequencing accuracy according to a trio-based screening model.
Symbiont-Screener: A reference-free tool to separate host sequences from symbionts for error-prone long reads
If you find MetaTrass is useful in your research, please cite:
System requirement:
- make
- gcc/g++ (version>4.8.5 to support c++11)
- python3 in your environment with packages :
- numpy
- pandas
- sklearn
- plotly
Third-party software:
conda install -c intel scikit-learn
conda install -c bioconda symbiontscreener
notice: easy-to-use_pipelines are not included in the conda version, please use sysc directly
git clone https://github.com/BGI-Qingdao/Symbiont-Screener ./Symbiont-Screener
cd ./Symbiont-Screener/sources
make
To run Symbiont-Screener, the minimum input dataset contains one paternal NGS read file, one maternal NGS read file, and one child‘s TGS read file.
The simulated PacBio CLR reads of the child with parental NGS reads can be downloaded from https://zenodo.org/record/7257538#.Y1vuaHZByUk :
maternal_mix_simngs.v2.fasta.gz
paternal_mix_simngs.v2.fasta.gz
mix_simpb.fasta.gz
- Usage example I:
- run Symbiont-Screener in strobemer mode with clustering. (Recommended
⚠️ )
- run Symbiont-Screener in strobemer mode with clustering. (Recommended
./Symbiont-Screener/easy-to-use_pipelines/sysc_strobmercluster_mode.sh \
--maternal maternal_mix_simngs.v2.fasta.gz \
--paternal paternal_mix_simngs.v2.fasta.gz \
--offspring mix_simpb.fasta.gz
- Usage example II:
- run Symbiont-Screener in strobemer mode without further clustering.
./Symbiont-Screener/easy-to-use_pipelines/sysc_strobmer_mode.sh \
--maternal maternal_mix_simngs.v2.fasta.gz \
--paternal paternal_mix_simngs.v2.fasta.gz \
--offspring mix_simpb.fasta.gz
- Usage example III :
- run Symbiont-Screener in k-mer mode with clustering.
./Symbiont-Screener/easy-to-use_pipelines/sysc_kmercluster_mode.sh \
--maternal maternal_mix_simngs.v2.fasta.gz \
--paternal paternal_mix_simngs.v2.fasta.gz \
--offspring mix_simpb.fasta.gz
- Usage example IV:
- run Symbiont-Screener in k-mer mode without further clustering.
./Symbiont-Screener/easy-to-use_pipelines/sysc_kmer_mode.sh \
--maternal maternal_mix_simngs.v2.fasta.gz \
--paternal paternal_mix_simngs.v2.fasta.gz \
--offspring mix_simpb.fasta.gz
-
sysc usage:
- type
./Symbioint-Screener/sysc -h
and get:
- type
Usage : sysc <action> [options]
Actions:
+---------+-----------------------+-----------------------+
|stage | strobemer mode (s40) | kmer mode (k21) |
+---------+-----------------------+-----------------------+
|step01 | build_s40 | build_k21 |
+---------+-----------------------+-----------------------+
|step02.1 | density_s40 | density_k21 |
|step02.2 | trio_result_s40 | trio_result_k21 |
+---------+-----------------------+-----------------------+
|step03.1 | cluster_s40 | cluster_k21 |
|step03.2 | consensus_cluster_s40 | consensus_cluster_k21 |
+---------+-----------------------+-----------------------+
- module usage:
- Each action has it's own usage, for example
./Symbioint-Screener/sysc build_s40 -h
get:
- Each action has it's own usage, for example
Usage :
./sysc build_s40 [OPTION]
Build randstrobe(2,10,30,30) based on paternal and maternal NGS reads by jellyfish.
Options :
basic input:
--paternal paternal NGS read file in FASTA/FASTQ format.
file in gzip format can be accepted, but filename must end by ".gz".
--maternal maternal NGS read file in FASTA/FASTQ format.
file in gzip format can be accepted, but filename must end by ".gz".
resources:
--thread thread number.
[ optional, default 8 threads. ]
--size initial hash table by jellyfish.
[ optional, default 1GB. ]
detail configuration:
--auto_bounds (0/1) automatically calculate lower and upper bounds based on k-mer analysis.
[ optional, default 1; ]
--m-lower assigned lower bound for maternal k-mer frequencies, which will ignore k-mers with count < m-lower.
[ optional, default 0. ]
--m-upper assigned upper bound for maternal k-mer frequencies, which will ignore k-mers with count > m-upper.
[ optional, default 0. ]
--p-lower assigned lower bound for paternal k-mer frequencies, which will ignore k-mers with count < p-lower.
[ optional, default 0. ]
--p-upper assigned upper bound for paternal k-mer frequencies, which will ignore k-mers with count > p-upper.
[ optional, default 0. ]
-
Please see details of all other actions by
sysc <action> -h
-
The workflows in example pipelines:
Each pipeline in easy-to-use_pipelines folder uses sysc command but has different workflows.
-
The four available workflows of sysc :
+--------------------------------------------------------------------------------+-------------------------------+
| Workflows | Example pipeline |
+---+------------------------------------------------------------------------+---+-------------------------------+
| | -> build_s40 -> density_s40 -> trio_result_s40 ----------------------> | | sysc_strobmer_mode.sh |
| S | | | | |
| T | +--------> cluster_s40 -> consensus_cluster_s40 -> | E | sysc_strobmercluster_mode.sh |
| A | | N | |
| R | -> build_k21 -> density_k21 -> trio_result_k21 ----------------------> | D | sysc_kmer_mode.sh |
| T | | | | |
| | +--------> cluster_k21 -> consensus_cluster_k21 -> | | sysc_kmercluster_mode.sh |
+---+------------------------------------------------------------------------+---+-------------------------------+
To avoid redundant work, the pipelines in the easy-to-use_pipelines folder only show basic parameters. However, sysc as the main entry, can accept more detailed parameters that you might need for each specific case. Therefore, we recommend that you create your own pipeline based on the pipelines in the easy-to-use_pipelines folder and modify these parameters.
First of all, please read the log and err files to find out why the program exits. Invalid or incomplete parameters are the main reasons. The sysc command creates tag files to log finished steps. Therefore, direct re-running the same command in the same path will automatically skip these finished steps and continue.
After you modify your pipeline, please move or rename previous result files so that the program will create a new working folder in the current path and restart without any influence on the previous results.
For example, if you only change the parameter of cluster_k21
, please rename/move/delete
both step03.1.k21
and step03.2.k21
so that program will re-run both cluster_k21
and consensus_cluster_k21
. In this case, please keep step01.k21
and step02.1.k21
unchanged so that program will re-use those results.
- The ngs depth parameters
Here are the most important parameters that you shall check.
Those five parameters exist in both build_s40
and build_k21
actions.
--auto_bounds
--m-lower
--m-upper
--p-lower
--p-upper
By default, --auto_bounds=1
so that program will infer the bounds automatically.
An interactive html named kmer_frequency.html
will be created in step02.1.k21
or step02.1.s40
folder as below :
If the host reads are not the majority of the mixed reads, you should set --auto_bounds=0
and customize your depth threshold based on this html image.
- The minimum number of best-hit for consensus clustering.
In clustering mode, an interactive html image named cluster.html
will be created in step03.1.s40
or step03.1.k21
folder like below :
Please set the --min_hit
parameter by the guide of this image.
- Other parameters
Please check our detailed usage for other parameters like the threads, memory control, etc...
See https://github.com/ksahlin/strobemers and https://github.com/BGI-Qingdao/strobemer_cpptest
Raising an issue is always recommended.
Email to xumengyang@genomics.cn, guolidong@genomics.cn, and qiyanwei1@genomics.cn
:)