This is a pipeline to perform long reads binning using the read overlap information.
Try the program at Google Colab
- Install seq2vec from https://github.com/anuradhawick/seq2vec. Please add the binary to your PATH variable.
- Install wtdbg2 (which hosts kbm2) from https://github.com/ruanjue/wtdbg2. Please add the binary to your PATH variable.
- Install seqtk from https://github.com/lh3/seqtk
You can run OBLR with CPU only. However, this can be extremely slow. The recommended approach is to have two python environments; 1) for pytorch 2) rapids.ai. Currently they are pre-built on different CUDA-toolkits. In future, one environment would be sufficient.
- biopython
- pytorch
- pytorch geometric
- numpy
- tqdm
- matplotlib (only for notebooks with plots)
- seaborn (only for notebooks with plots)
- CuML from https://rapids.ai/start.html
- tqdm
- numpy
- pandas
- matplotlib (only for notebooks with plots)
- seaborn (only for notebooks with plots)
Note: If you're planning to skip Rapids.AI installation please make sure you have umap-learn and hdbscan installed in the pytorch environment. Please install hdbscan from github as conda bundle may have bugs.
On one terminal tab, run following.
conda activate rapids-21.10
jupyter notebook --port 8888
In another terminal tab, run following.
conda activate pytorch
jupyter notebook --port 8889
You can indeed run with nohup in background.
Now open step-1-4.ipynb
file from rapids-21.10
environment. Follow the instructions inside the notebook. Once finished, open step-5.ipynb
from the pytorch
environment and follow instruction. At the end of the notebook you can separate reads into bins for assembly task.
One might consider running the program on a server. The pipeline is currently availabe as scripts.
Step-1: Preprocess reads
# your experiment path which has reads
exp = "./test_data/"
# renaming reads
seqtk rename $exp/reads.fastq read_ | seqtk seq -A > $exp/reads.fasta
# obtaining read ids
grep ">" $exp/reads.fasta > $exp/read_ids
Step-2: Build the graph
exp = "./test_data/"
# compute 4mer vectors (-t for threads)
seq2vec -k 4 -o $exp/4mers -f $exp/reads.fasta -t 32
# build the graph using chunked reads
./buildgraph_with_chunks.sh -r $exp//reads.fasta -c <CHUNK_SIZE> -o $exp/
Step-3: Detect clusters
exp = "./test_data/"
# activate rapids environment (or use pytorch environment as advised)
conda activate rapids-21.10
# reads.alns and degree are created from kbm2 pipeline command
python ./build_graph.py \
-a $exp/reads.alns \
-d $exp/degree \
-f $exp/4mers \
-i $exp/read_ids \
-o $exp/
Step-4: Detect clusters
exp = "./test_data/"
# activate pytorch environment
conda activate pytorch
# data.npz used from step-2
python sage_label_prop.py \
-d $exp/data.npz \
-o $exp/
Note if you chose to run everything in a single script, refer to file
oblr_runner.sh
to see how one can change conda environment in a bash file.
kbm2
is the most resource demanding step. ~32GB of RAM or above is recommended. If you have fast storage consider increasing swap. May be slower but kbm2 will run for sure. Alternatively, use the chunked version of the graph building in as shown in step 2.
Rapids.AI is advised as worst case GPU time was 1:30.83 while CPU time was 34:15.96 in a 32 thread machine (almost 1000x gain in speed compared to single threaded mode).
@inproceedings{wickramarachchi2022metagenomics,
title={Metagenomics Binning of Long Reads Using Read-Overlap Graphs},
author={Wickramarachchi, Anuradha and Lin, Yu},
booktitle={Comparative Genomics: 19th International Conference, RECOMB-CG 2022, La Jolla, CA, USA, May 20--21, 2022, Proceedings},
pages={260--278},
year={2022},
organization={Springer}
}