-
Notifications
You must be signed in to change notification settings - Fork 27
Using GATB Core integrated Leon compressor
As of GATB-Core 1.4.0, the Leon compressor has been integrated into GATB-Core library.
It means that the Leon file format can now be handled natively by all softwares relying upon GATB-Core. In other words, you can apply data processing on reads without decompression of the Leon file.
Leon compressor/decompressor is available as a binary tool as soon as you have compiled the GATB-Core library. That binary is called 'leon' and it is located next to other GATB-Core tools (dbgh5, dbginfo, ...) into the 'build/bin' directory.
To compress raw DNA sequence files (Fastq and Fasta in plain text or gzipped), use Leon as follows:
leon -c -lossless -file <your-file>
To compress raw DNA sequence files using lossy mode (only applies on Fastq files), use:
leon -c -file <your-file>
As soon as Leon has finished to compress your data file, you'll see a '.leon' file next to your DNA sequence file: this is the Leon compressed file.
Please, refer to Leon user manual for more information on how to use this tool.
You can programmatically open and read the content (i.e. sequences) of a Leon '.leon' file in a very straightforward way as follows:
IBank* leonBank = Bank::open ("/path/to/sequence-file.leon");
Quite simple isn't it? Then, you use the reference to IBank ('leonBank' variable) as you would do for any other kind of sequence banks (Fasta and Fastq). For instance, here is how to iterate over sequences:
Iterator<Sequence>* itLeon = leonBank->iterator();
itLeon = leonBank->iterator();
LOCAL(itLeon);
for (itLeon->first(); !itLeon->isDone(); itLeon->next()){
Sequence& seq = itLeon->item();
//to get sequence definition line, use: seq.getComment()
//to get sequence itself (nucleotides), use: seq.toString()
//to get sequence quality (Fastq only), use: seq.getQuality()
}
Instead of using Leon through its command-line tool, you can also use it programmatically as follows:
// we prepare the Leon command-line
std::vector<char*> leon_args;
std::vector<std::string> data = {
"-",
"-c",
"-file", fastqFile,
"-lossless", // <-- LOSSLESS
"-verbose","0",
"-kmer-size", "31",
"-abundance", "4"
};
for(std::vector<std::string>::iterator loop = data.begin(); loop != data.end(); ++loop){
leon_args.push_back(&(*loop)[0]);
}
// we start Leon compressor
Leon().run(leon_args.size(), &leon_args[0]);
To review various ways of using Leon programmatically, please refer to TestLeon.cpp source code file.
Leon actually stores all the compressed data (nucleotides and quality) using HDF5 standard. This means you can use GATB-Core tool 'gatb-h5dump' to get information about the content of such a '.leon' file.
Originally, Leon compressor was designed as a software made outside the GATB-Core library. This one is called GATB-Tool-Leon or standalone Leon, for short. This software is still alive and available here.
With the integration of Leon compression/decompression algorithm within GATB-Core Library, source code of standalone Leon is dramatically simplified, as you can see here.
So, standalone Leon and Leon compressor available as part of GATB-Core are exactly the same software. You can use whatever release suites for your own usage. '.leon' files generated by these two flavours of Leon are 100% fully compatible.