- Efficient Data Structure: Stochastic tile hashing for rapid sequence comparison
- Memory Scalable: Configurable memory usage based on dataset size
- Multi-threaded: OpenMP support for parallel processing
- Two-stage Pipeline: Separate fill and cut operations for optimal performance
- Robust Detection: Handles sequence errors and gaps in genomic data
- Installation
- Quick Start
- Usage
- Algorithm Overview
- API Documentation
- Demo
- Publications
- Contributing
- License
- Compiler: C++ compiler with OpenMP support (GCC 4.9+ or Clang 3.5+)
- Build System: CMake 3.10 or higher
- Dependencies: btllib v1.7.1-0
Install dependencies using Conda:
conda install -c anaconda cmake
conda install -c bioconda btllib
Clone the repository and build Stash:
git clone <repository-url>
cd Stochastic_Tile_Hashing
cmake -S . -B Build
cd Build
make
After successful compilation, you will find:
- Executable:
Stochastic_Tile_Hashing/Build/Stash/Stash
- Static Library:
Stochastic_Tile_Hashing/Build/Stash/libStash.a
- Headers:
Stochastic_Tile_Hashing/Stash/Include/
-
Fill a Stash with your sequencing reads:
./Stash fill -r reads.fa -o stash.bin -l 30 -t 8
-
Detect and correct misassemblies in your assembly:
./Stash cut -a assembly.fa -o corrected_assembly.fa -s stash.bin -t 8
The Stash executable operates in two distinct modes:
Populates a Stash data structure with sequencing reads for subsequent analysis.
Memory Requirements: The Stash size is 2^(logRows + 3)
bytes. For example, with logRows=30
, Stash requires 8 GB of memory.
Parameter | Short | Description | Default |
---|---|---|---|
--reads |
-r |
Input reads in FASTA format | Required |
--output |
-o |
Output Stash file path | Required |
--logRows |
-l |
Log₂ of number of Stash rows | 30 |
--threads |
-t |
Number of processing threads | 8 |
./Stash fill -r reads.fa -o stash.bin -l 30 -t 8
Analyzes an assembly against a populated Stash to detect and correct misassemblies.
Parameter | Short | Description | Default |
---|---|---|---|
--assembly |
-a |
Input assembly in FASTA format | Required |
--stash |
-s |
Input Stash file path | Required |
--output |
-o |
Corrected assembly output path | Required |
--threads |
-t |
Number of processing threads | 8 |
--number_of_frames |
-n |
Number of frames for analysis | 1 |
--stride |
-r |
Stride between frames | 13 |
--delta |
-l |
Delta parameter | 751 |
--threshold |
-x |
Cut threshold for misassembly detection | 11 |
--max_pooling_radius |
-m |
Maximum pooling radius | 1 |
--min_cut_distance |
-d |
Minimum distance between cuts | 1000 |
./Stash cut -a assembly.fa -o corrected_assembly.fa -s stash.bin -t 8
The Stash data structure implements stochastic tile hashing for efficient sequence comparison. A new Stash can be created using:
Stash(uint32_t logRows, const std::vector<std::string>& spacedSeeds)
Constructor Parameters:
logRows
: Determines the number of rows as 2^logRowsspacedSeeds
: Vector of strings representing spaced seed frames
Persistence:
void save(const char* path)
: Saves Stash data and parameters to binary fileStash(const char* path)
: Restores Stash from saved binary file
The following figure illustrates the Stash data structure algorithm and sequence data population process with 4 spaced seed patterns (h1-h4). The algorithm combines spaced seed output values with sequence ID hashes, where for the i-th spaced seed pattern, the i-th most significant bits of other spaced seed pattern outputs are concatenated to address an index in the sequence ID hash tiles, specifying the column and stored tile value in Stash.
A frame represents the set of Stash rows accessed for a given spaced seed frame. It is implemented as a two-dimensional array of tiles with:
- Width: Number of spaced seeds
- Height: Number of Stash columns
Number of Matches Metric: Defined between two Stash frames, this metric quantifies sequence similarity. For related frames, this value is significantly larger than for unrelated frames. In genome misassembly detection, this metric represents the number of overlapping sequencing reads covering a genomic region (read set coverage).
Windows of frames provide a robust approach to genomic region analysis. Each window consists of frames separated by a stride distance. Windows are compared by counting the maximum number of matches between all frame pairs, providing:
- General genomic region view: Comprehensive analysis beyond individual frames
- Error tolerance: Robust handling of sequence errors and gaps
- Improved accuracy: Statistical significance through multiple frame comparisons
The Stash library provides a C++ API for integration into custom applications:
#include "Stash/Include/Stash.h"
// Create a new Stash
Stash stash(30, spacedSeeds);
// Save to file
stash.save("my_stash.bin");
// Load from file
Stash loadedStash("my_stash.bin");
Link against the static library:
g++ -o my_program my_program.cpp -L./Stash -lStash -fopenmp
The TestData
directory contains example datasets to demonstrate Stash's capabilities:
test_reads.fa
: 14 duplicate reads for Stash populationtest_assembly.fa
: Assembly with misassembled patterns
-
Populate Stash with test reads:
./Stash fill -r test_reads.fa -o stash.bin
-
Correct misassemblies in test assembly:
./Stash cut -a test_assembly.fa -o corrected_assembly.fa -s stash.bin
The corrected assembly will have misassembled patterns identified and appropriately segmented.
-
Sarvar, A., Coombe, L., Warren, R., & Birol, I. (2023, April 14–19). Stash: A data structure based on stochastic tile hashing [Conference presentation]. RECOMB-Seq Satellite Conference on Biological Sequence Analysis 2023, Istanbul, Turkey.
-
Sarvar, A., Coombe, L., Warren, R., & Birol, I. (2022, July 10–14). Genome misassembly detection using Stash: A data structure based on stochastic tile hashing [Conference presentation]. Intelligent Systems for Molecular Biology 2022, Madison, WI, United States.
We welcome contributions to improve Stash! Please see our Contributing Guidelines for details on:
- Code style and standards
- Testing requirements
- Pull request process
- Issue reporting
This project is licensed under the MIT License - see the LICENSE file for details.
- Research concept and design: Inanc Birol and Armaghan Sarvar
- Implementation and experimentation: Armaghan Sarvar
- Feedback and guidance: Lauren Coombe, René Warren
For questions, issues, or feature requests, please visit our GitHub Issues page.