This software is the implementation of eGSA [1, 2], an external memory algorithm to construct generalized enhanced suffix arrays.
Given a collection of K strings, eGSA outputs the:
- Generalized suffix array
- LCP-array
- Burrows-Wheeler transform (optional)
git clone https://github.com/felipelouza/egsa.git
cd egsa
make
Given a collection of K strings concatenated in a single file FILE.
./egsa [options] FILE K
Available options:
-h this help message
-b output BWT (ext: .bwt)
-c check SA and LCP
-v verbose output
Notes
- K=0 gives all strings as input.
- Supported extensions are: .txt, .fasta and .fastq.
- Strings are separated per '\0' (new line) in .txt.
The generalized enhanced suffix array is stored in a binary file in the same directory of FILE:
FILE.K.gesa
The generalized BWT (option -b) is stored in a textual file in the same directory of FILE:
FILE.K.bwt
Notes
-
The output binary FILE.K.gesa can be read as a struct type according to the typedef in: lib/utils.h
-
Temporary files are stored in subfolders:
partition/
tmp/
Output the BWT inside GESA structure, that is GESA[i] = (text, suffix, lcp, bwt):
make clean
make compile BWT=1
One can inform the maximum available internal memory to be used (in MB):
make clean
make compile MEMLIMIT=10
To see a more detailed execution output use:
make clean
make compile DEBUG=1
We have included the source codes of the following algorithms:
- gSACA-K: generalized suffix array construction algorithm [3]
- lcp-PHI: LCP-array construction algorithm given the suffix array [4]
Please, if you use egsa tool in an academic setting cite the following paper [1]:
@article{Louza2017d,
author = {Louza, Felipe A. and Telles, Guilherme P. and Hoffmann, Steve and Ciferri, Cristina D. A.},
title={Generalized enhanced suffix array construction in external memory},
journal={Algorithms for Molecular Biology},
year = {2017},
pages = {26:1--26:16},
doi = {10.1186/s13015-017-0117-9}
}
[1] Louza, F. A., Telles, G. P., Hoffmann, S., Ciferri, C. D. A. (2017). Generalized enhanced suffix array construction in external memory. Algorithms for Molecular Biology 12(1): 26:1-26:16. link
[2] Louza, F. A., Telles, G. P., Ciferri, C. D. A. (2013). External Memory Generalized Suffix and LCP Arrays Construction. In Proc. CPM (pp. 201-210). link
[3] Louza, F. A., Gog, S., Telles, G. P. (2016). Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678: 22-39, github
[4] Kärkkäinen, J., Manzini, G., & Puglisi, S. J. (2009). Permuted Longest-Common-Prefix Array. In G. Kucherov & E. Ukkonen (Eds.), Proc. CPM (Vol. 5577, pp. 181–192).
Thanks to Fabio Garofalo, Giovanna Rosone, Giovanni Manzini, Nicola Prezza and Guilherme P. Telles by helpful suggestions and debugging.