Skip to content

Using GATB Core integrated Leon compressor

Genscale Team edited this page Jul 25, 2017 · 7 revisions

Introduction

As of GATB-Core 1.4.0, the Leon compressor has been integrated into GATB-Core library.

It means that the Leon file format can now be handled natively by all softwares relying upon GATB-Core. In other words, you can apply data processing on reads without decompression of the Leon file.

How to compress raw sequence files?

Leon compressor/decompressor is available as a binary tool as soon as you have compiled the GATB-Core library. That binary is called 'leon' and it is located next to other GATB-Core tools (dbgh5, dbginfo, ...) into the 'build/bin' directory.

To compress raw DNA sequence files (Fastq and Fasta in plain text or gzipped), use Leon as follows:

leon -c -lossless -file <your-file>

To compress raw DNA sequence files using lossy mode (only applies on Fastq files), use:

leon -c -file <your-file>

As soon as Leon has finished to compress your data file, you'll see a '.h5' file next to your DNA sequence file: this is the Leon compressed file.

How to read Leon compressed files in a c++ code?

You can programmatically open and read the content (i.e. sequences) of a Leon '.h5' file in a very straightforward way as follows:

IBank* leonBank = Bank::open ("/path/to/leon-file.h5");

Quite simple isn't it? Then, you use the reference to IBank ('leonBank' variable) as you would do for any other kind of sequence banks (Fasta and Fastq). For instance, here is how to iterate over sequences:

Iterator<Sequence>* itLeon = leonBank->iterator();
itLeon = leonBank->iterator();
LOCAL(itLeon);
for (itLeon->first(); !itLeon->isDone(); itLeon->next()){
    Sequence& seq = itLeon->item();
    //to get sequence definition line, use: seq.getComment()
    //to get sequence itself (nucleotides), use: seq.toString()
    //to get sequence quality (Fastq only), use: seq.getQuality()
}

How to call Leon compressor in a c++ code?

Instead of using Leon through its command-line tool, you can also use it programmatically as follows:

// we prepare the Leon command-line
std::vector<char*>       leon_args;
std::vector<std::string> data = {
    "-",
    "-c",
    "-file", fastqFile,
    "-lossless", // <-- LOSSLESS
    "-verbose","0",
    "-kmer-size", "31",
    "-abundance", "4"
};
for(std::vector<std::string>::iterator loop = data.begin(); loop != data.end(); ++loop){
    leon_args.push_back(&(*loop)[0]);
}
// we start Leon compressor
Leon().run(leon_args.size(), &leon_args[0]);

To review various ways of using Leon programmatically, please refer to TestLeon.cpp source code file.