A C++ library for indexing genome sequencing datasets by using colored de Bruijn Graph, hash functions and Bloom Filter. The implementation is based on this library by Diego Diaz Dominguez et al.
This tool requires:
- Ubuntu 18.04
First, download the library and move to library's root directory.
git clone git@github.com:cBioLab/hash_cdbg.git
cd hash_cdbg
Then, prepare for compilation.
mkdir build && cd build
cmake ..
If you want to specify the directory in which to install this library, you can use:
cmake .. -DCMAKE_INSTALL_PREFIX={your_install_path}/hash_cdbg
Finally, compile and install the library.
make & make install
To use this library quickly, look in the util directory. build_cdbg.cpp is a code that builds an index, the detail of which is as follow:
#include <iostream>
#include <hash_cdbg/boss.hpp>
int main(int argc, char* argv[]) {
std::string input_file = "data/example.fastq";
size_t kmer_size = 30;
size_t n_threads = 1;
dbg_boss dbg_index(input_file, kmer_size, n_threads);
store_to_file(dbg_index, "example.cdbg");
return 0;
}
To compile and execute this code, do the following:
cd hash_cdbg
g++ -o build_cdbg.out ./util/build_cdbg.cpp -I {your_install_path}/include -L {your_install_path}/lib -lhash_cdbg -lsdsl -ldivsufsort -ldivsufsort64 -lpthread -lz -std=c++17 -O3
./build_cdbg.out
The resulting example.cdbg
is the index file.
To rebuild the original sequences from this index, do the following using build_fm_index.cpp and rebuild_seqs.cpp:
g++ -o build_fm_index.out ./util/build_fm_index.cpp -I {your_install_path}/include -L {your_install_path}/lib -lhash_cdbg -lsdsl -ldivsufsort -ldivsufsort64 -lpthread -lz
./build_fm_index.out data/example.fastq example
g++ -o rebuild_seqs.out ./util/rebuild_seqs.cpp -I {your_install_path}/include -L {your_install_path}/lib -lhash_cdbg -lsdsl -ldivsufsort -ldivsufsort64 -lpthread -lz -std=c++17 -O3
./rebuild_seqs.out example.cdbg example.fm_index 1 example.re
The resulting example.re.fasta
is a fasta file that contains the example.fastq
sequences and it's reverse complements rebuilt.
If you want to reproduce our experiments, see experiments README.
This tool does not support reads containing N bases. Run remove_n_read.cpp to remove reads containing N bases as a preprocessing step.
g++ -o remove_n_read.out ./util/remove_n_read.cpp -lpthread -std=c++17 -O3
./remove_n_read.out {your_fastq_file} {output_fastq_file}