A tool to compress a set of k-mers represented in FASTA/FASTQ/KFF file(s).
There are 2 ways to install ESS-Compress: either from source or from pre-compiled binaries.
-
Linux operating system (64 bit)
-
Git
-
GCC >= 4.8 or a C++11 capable compiler
-
CMake 3.1+
Download source and install:
git clone https://github.com/medvedevgroup/ESSCompress
cd ESSCompress
./INSTALL
Upon successful execution of this script, you will see linux binaries for kff-tools (essAuxKffTools
), Blight (essAuxBlight
), BCALM (essAuxBcalm
), DSK (essAuxDsk
and essAuxDsk2ascii
) and MFCompress (essAuxMFCompressC
and essAuxMFCompressD
) in the aux
folder, along with essAuxValidate
, essAuxCompress
and essAuxDecompress
and getMaxLen
.
- Linux operating system (64 bit)
-
Download the latest Linux 64-bit binaries
wget https://github.com/medvedevgroup/ESSCompress/releases/download/v3.1/essCompress-v3.1-linux-64.tar.gz
-
Extract the
.tar.gz
file and change into uncompressed directory.
tar xvzf essCompress-v3.1-linux-64.tar.gz
cd essCompress-v3.1/
-
You will see two executables in the directory named
essCompress
andessDecompress
.-
You can either refer to these two executables directly when compressing/decompressing (using the command
./essCompress
and./essDecompress
), -
Or, you can move/copy ALL the executables in
essCompress-v3.1/bin
to thebin
directory that is already in your PATH. For instance, considering/usr/bin
is already in PATH, you need to run the commandmv ess* /usr/bin
to move all executables for ESS-Compress software. An alternative to moving/copying executables is adding the location ofessCompress-v3.1/bin
to your PATH.
-
This example assumes that you are currently inside the base directory essCompress-v3.1
after you have completed installing the tool as per the instructions.
Lets say you have a small fasta file of sequences, i.e. examples/smallExample.fa, and
cat examples/smallExample.fa
returns
>
AAAAAAACCCCCCCCCC
>
CCCCCCCCCCA
We can compress it using k=11 as follows
./bin/essCompress -k 11 -i examples/smallExample.fa
Now ls examples
will show both original input file and compressed file in the same directory:
smallExample.fa
smallExample.fa.essc
...
smallExample.fa.essc is a compressed binary file generated by MFCompress, so it is not in a readable format.
To decompress into a readable format, you can run
./bin/essDecompress examples/smallExample.fa.essc
You'll now see the decompressed file example.fa.essd in the same directory.
cat examples/smallExample.fa.essd
will return:
>
AAAAAAACCCCCCCCCCA
Notice that the decompressed fasta file is not the same as the original file, but it contains the same k-mers as smallExample.fa. You can double check this using the command
./bin/essAuxValidate 11 examples/smallExample.fa examples/smallExample.fa.essd
If they contain the same k-mers (i.e. 11-mers), you will see an output like this:
### SUCCESS: The two files contain same k-mers! ###
Syntax: ./essCompress [parameters]
mandatory arguments:
-k [int] k-mer size (must be >=4)
-i [input-file] Path to input file. Input file can be either of these 3 formats:
1. a single fasta/fastq file (either gzipped or not)
2. a single text file containing the list of multiple fasta/fastq files (one file per line)
3. a single .kff file. In this case, output is a .kff file after compressing in UST mode.
optional arguments:
-a [int] Default=1. Sets a threshold X, such that k-mers that appear less than X times in the input dataset are filtered out.
-o [output-dir] Specify output directory
-t [int] Default=1. Number of threads (used by bcalm, dsk and blight).
-x [int] Default=1. Bytes allocated for associated abundance data per k-mer in kff. For highest compression with kff, by default the program limits 1 byte per k-mer (max value 255).
-f Fast compression mode: uses less memory, but achieves smaller compression ratio.
-u UST mode (output an SPSS, which does not contain any duplicate k-mers and the k-mers it contains are exactly the distinct k-mers in the input. A k-mer and its reverse complement are treated as equal.)
-d DEBUG mode. If debug mode is enabled, no intermediate files are removed.
-v Enable verbose mode: print more useful information.
-c Verify correctness: check that all the distinct k-mers in the input file appears exactly once in compressed file.
-h Print this Help
-V Print version number
Two important input parameters are
- input [-i]
- k-mer size [-k]
If input is a .kff file, [-k] parameter is disregarded.
File input format can be
1. a single fasta or fastq file (either gzipped or not)
2. a single text file containing the list of multiple fasta/fastq files (one file per line)
3. a single .kff file. In this case, output is a .kff file after compressing in UST mode.
To pass a single FASTA file as input and compress: ./bin/essCompress -i examples/11mers.fa -k 11
To pass a single KFF file as input and compress: ./bin/essCompress -i examples/kmc_k15.kff
To pass several files as input, generate the list of files (one file per line) as follows:
ls -1 examples/*.fa > list_reads
./bin/essCompress -i list_reads -k 5
ESS-Compress uses BCALM 2 under the hood, which does not care about paired-end information, all given reads contribute to k-mers in the graph (as long as such k-mers pass the abundance threshold).
If using fast mode/normal mode:
the compressed output is in a file with .essc
extension.
If using UST mode without kff:
the compressed output is in a file with .fa.essd
extension.
If compressing a kff file:
the compressed output is in a file with .compressed.kff
extension.
Syntax: ./essDecompress [file_to_decompress]
Input: a .essc file generated by essCompress
Output: a fasta file with .essd extension, where all the distinct k-mers represented by the input .essc file appear exactly once. In other words, output is a spectrum-preserving string set.
If using ESS-Compresss in your research, please cite
- Amatur Rahman, Rayan Chikhi and Paul Medvedev, Disk compression of k-mer sets, WABI 2020.