Stash: A Data Structure for Genome Misassembly Detection and Correction

Overview

Key Features

Efficient Data Structure: Stochastic tile hashing for rapid sequence comparison
Memory Scalable: Configurable memory usage based on dataset size
Multi-threaded: OpenMP support for parallel processing
Two-stage Pipeline: Separate fill and cut operations for optimal performance
Robust Detection: Handles sequence errors and gaps in genomic data

Installation

Prerequisites

Compiler: C++ compiler with OpenMP support (GCC 4.9+ or Clang 3.5+)
Build System: CMake 3.10 or higher
Dependencies: btllib v1.7.1-0

Dependency Installation

Install dependencies using Conda:

conda install -c anaconda cmake
conda install -c bioconda btllib

Building from Source

Clone the repository and build Stash:

git clone <repository-url>
cd Stochastic_Tile_Hashing
cmake -S . -B Build
cd Build
make

Build Output

After successful compilation, you will find:

Executable: Stochastic_Tile_Hashing/Build/Stash/Stash
Static Library: Stochastic_Tile_Hashing/Build/Stash/libStash.a
Headers: Stochastic_Tile_Hashing/Stash/Include/

Quick Start

Fill a Stash with your sequencing reads:

./Stash fill -r reads.fa -o stash.bin -l 30 -t 8

Detect and correct misassemblies in your assembly:

./Stash cut -a assembly.fa -o corrected_assembly.fa -s stash.bin -t 8

Usage

The Stash executable operates in two distinct modes:

Fill Mode

Populates a Stash data structure with sequencing reads for subsequent analysis.

Memory Requirements: The Stash size is 2^(logRows + 3) bytes. For example, with logRows=30, Stash requires 8 GB of memory.

Parameters

Parameter	Short	Description	Default
`--reads`	`-r`	Input reads in FASTA format	Required
`--output`	`-o`	Output Stash file path	Required
`--logRows`	`-l`	Log₂ of number of Stash rows	30
`--threads`	`-t`	Number of processing threads	8

Example

./Stash fill -r reads.fa -o stash.bin -l 30 -t 8

Cut Mode

Analyzes an assembly against a populated Stash to detect and correct misassemblies.

Parameters

Parameter	Short	Description	Default
`--assembly`	`-a`	Input assembly in FASTA format	Required
`--stash`	`-s`	Input Stash file path	Required
`--output`	`-o`	Corrected assembly output path	Required
`--threads`	`-t`	Number of processing threads	8
`--number_of_frames`	`-n`	Number of frames for analysis	1
`--stride`	`-r`	Stride between frames	13
`--delta`	`-l`	Delta parameter	751
`--threshold`	`-x`	Cut threshold for misassembly detection	11
`--max_pooling_radius`	`-m`	Maximum pooling radius	1
`--min_cut_distance`	`-d`	Minimum distance between cuts	1000

Example

./Stash cut -a assembly.fa -o corrected_assembly.fa -s stash.bin -t 8

Algorithm Overview

Stash Data Structure

The Stash data structure implements stochastic tile hashing for efficient sequence comparison. A new Stash can be created using:

Stash(uint32_t logRows, const std::vector<std::string>& spacedSeeds)

Constructor Parameters:

logRows: Determines the number of rows as 2^logRows
spacedSeeds: Vector of strings representing spaced seed frames

Persistence:

void save(const char* path): Saves Stash data and parameters to binary file
Stash(const char* path): Restores Stash from saved binary file

The following figure illustrates the Stash data structure algorithm and sequence data population process with 4 spaced seed patterns (h1-h4). The algorithm combines spaced seed output values with sequence ID hashes, where for the i-th spaced seed pattern, the i-th most significant bits of other spaced seed pattern outputs are concatenated to address an index in the sequence ID hash tiles, specifying the column and stored tile value in Stash.

Frames

A frame represents the set of Stash rows accessed for a given spaced seed frame. It is implemented as a two-dimensional array of tiles with:

Width: Number of spaced seeds
Height: Number of Stash columns

Number of Matches Metric: Defined between two Stash frames, this metric quantifies sequence similarity. For related frames, this value is significantly larger than for unrelated frames. In genome misassembly detection, this metric represents the number of overlapping sequencing reads covering a genomic region (read set coverage).

Windows of Frames

Windows of frames provide a robust approach to genomic region analysis. Each window consists of frames separated by a stride distance. Windows are compared by counting the maximum number of matches between all frame pairs, providing:

General genomic region view: Comprehensive analysis beyond individual frames
Error tolerance: Robust handling of sequence errors and gaps
Improved accuracy: Statistical significance through multiple frame comparisons

API Documentation

C++ API

The Stash library provides a C++ API for integration into custom applications:

#include "Stash/Include/Stash.h"

// Create a new Stash
Stash stash(30, spacedSeeds);

// Save to file
stash.save("my_stash.bin");

// Load from file
Stash loadedStash("my_stash.bin");

Linking

Link against the static library:

g++ -o my_program my_program.cpp -L./Stash -lStash -fopenmp

Demo

The TestData directory contains example datasets to demonstrate Stash's capabilities:

Test Dataset

test_reads.fa: 14 duplicate reads for Stash population
test_assembly.fa: Assembly with misassembled patterns

Running the Demo

Populate Stash with test reads:

./Stash fill -r test_reads.fa -o stash.bin

Correct misassemblies in test assembly:

./Stash cut -a test_assembly.fa -o corrected_assembly.fa -s stash.bin

The corrected assembly will have misassembled patterns identified and appropriately segmented.

Publications

Conference Presentations

Sarvar, A., Coombe, L., Warren, R., & Birol, I. (2023, April 14–19). Stash: A data structure based on stochastic tile hashing [Conference presentation]. RECOMB-Seq Satellite Conference on Biological Sequence Analysis 2023, Istanbul, Turkey.
Sarvar, A., Coombe, L., Warren, R., & Birol, I. (2022, July 10–14). Genome misassembly detection using Stash: A data structure based on stochastic tile hashing [Conference presentation]. Intelligent Systems for Molecular Biology 2022, Madison, WI, United States.

Contributing

We welcome contributions to improve Stash! Please see our Contributing Guidelines for details on:

Code style and standards
Testing requirements
Pull request process
Issue reporting

License

This project is licensed under the MIT License - see the LICENSE file for details.

Credits

Research concept and design: Inanc Birol and Armaghan Sarvar
Implementation and experimentation: Armaghan Sarvar
Feedback and guidance: Lauren Coombe, René Warren

For questions, issues, or feature requests, please visit our GitHub Issues page.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
Stash		Stash
TestData		TestData
figures		figures
.gitignore		.gitignore
CLI11.hpp		CLI11.hpp
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
main.cpp		main.cpp

License

bcgsc/Stash

Folders and files

Latest commit

History

Repository files navigation

Stash: A Data Structure for Genome Misassembly Detection and Correction

Overview

Key Features

Table of Contents

Installation

Prerequisites

Dependency Installation

Building from Source

Build Output

Quick Start

Usage

Fill Mode

Parameters

Example

Cut Mode

Parameters

Example

Algorithm Overview

Stash Data Structure

Frames

Windows of Frames

API Documentation

C++ API

Linking

Demo

Test Dataset

Running the Demo

Publications

Conference Presentations

Contributing

License

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages