INDELseek Algorithm Reimplementation and Optimization

This repository contains the implementation and optimization of the INDELseek algorithm, an open-source tool designed for detecting complex insertions and deletions (indels) in Next-Generation Sequencing (NGS) data, as described the "INDELseek: detection of complex insertions and deletions from next-generation sequencing data" by Au C.H. et. al. Originally implemented in Perl, INDELseek is a crucial tool for identifying genetic variations that may play significant roles in diseases, including cancer. Our project focuses on re-implementing the algorithm in Python and C++ to enhance performance and usability, as well as comparing it to existing tools like GATK.

This analysis was performed as part of the final project for the "Algorithms in Molecular Biology" graduate course of the MSc Data Science & Information Technologies Master's programme (Bioinformatics - Biomedical Data Science Specialization) of the Department of Informatics and Telecommunications department of the National and Kapodistrian University of Athens (NKUA), under the supervision of professors Theodore Dalamagas and George Georgakilas, in the academic year 2022-2023.

Contributors

Workflow and Tools

Key Components

Original INDELseek Algorithm: Perl script enhanced with execution time tracking and memory logging.
Python Implementation: Uses dictionaries for caching and simplifies debugging with modular design.
C++ Implementation: Incorporates hash maps for caching, custom structs for memory efficiency, and a Makefile for easy compilation.
Memory Usage Visualization: graph.sh generates graphs for RAM usage over time during execution.

Tools and Datasets

Datasets: NGS datasets including NA12878 (1.5 billion reads) and HG00148.chr11 (4 million reads) were used.
Computing Environment: Testing was performed locally and on a Hypatia VM (7 cores, 60GB RAM).
Dependencies: Scripts rely on samtools for sequence processing and alignment.

Results

Execution Time: The C++ implementation demonstrated significant speed improvements (40 minutes for ~4M reads) compared to Python (5 hours 42 minutes) and Perl (5 hours 49 minutes).
Memory Usage: Resource footprint was consistently low (~600MB RAM for a 4M reads dataset).
Insights: Compiled languages like C++ show clear performance advantages, albeit at increased coding complexity.

Installation

Cloning the Repository

git clone https://github.com/GiatrasKon/INDELseek-Optimized-Implementations.git

Required Tools

samtools
- Processes BAM/SAM files for read alignments.
- Fetches reference sequences for specific regions using faidx.
- Computes depth of coverage using depth.
- Installation:

sudo apt install samtools

gnuplot
- Used in graph.sh to generate memory usage graphs for the running process.
- Installation:

sudo apt install gnuplot

perl
- The original INDELseek implementation is written in Perl. If you plan to compare outputs, ensure Perl is installed with the required modules like Getopt::Long.
- Installation:

sudo apt install perl

Dependencies

Python:

Ensure you have Python 3.x installed. Use the following commands to install required packages:

pip install argparse psutil subprocess datetime time collections pprint

C++ and Perl:

Ensure gcc (C++ compiler) and make are installed.
Install samtools (v1.3 or higher).
Verify Perl is installed with required modules like Getopt::Long.
Compile the C++ Implementation: Navigate to the scripts directory and run:

make

Datasets

NA12878 BAM File:
- This is a 113GB BAM file containing approximately 1.5 billion reads. It is aligned to the hg38 reference genome.
- Used to attempt reproduction of the original INDELseek results.
- Extracted subsets (chromosomes 20 and 15) for faster testing and analysis.
- How to Obtain:
  - Visit Illumina BaseSpace (requires an account).
  - Look for the NA12878 sample in publicly available datasets or the Genome in a Bottle (GIAB) project.
HG00148.chr11 BAM File:
- A BAM file containing approximately 4 million reads focused on chromosome 11.
- Used for benchmarking the Python and C++ implementations.
- Provides a smaller, manageable dataset for performance comparisons.
- How to Obtain:
  - Visit the 1000 Genomes Project website.
  - Download the HG00148 dataset.
Reference Genome (hg38):
- A FASTA file of the human genome (hg38 version) used for sequence alignment and comparison.
- Required for samtools operations like faidx and depth.
- Essential for running all INDELseek implementations.
- How to Obtain:
  - Broad Institute Public Repository.
  - UCSC Genome Browser.

How to Download and Prepare the Datasets:

NA12878 and HG00148: - BAM files can be downloaded from their respective sources using web interfaces or wget/curl commands (if direct URLs are provided).
Reference Genome: - Download the hg38.fa file and index it using samtools:

samtools faidx hg38.fa

Notes for Users:

Ensure sufficient storage space for large datasets.
Validate downloaded files (e.g., by comparing checksums).
Ensure compatibility between the reference genome version and BAM file alignments.
Follow ethical guidelines and licensing agreements when using public datasets.

Usage

Python:

python indelseek.py --ref_genome path/to/reference.fa --input_bam path/to/input.bam

C++:

./indelseek --reference path/to/reference.fa --input_bam path/to/input.bam --output_vcf output.vcf

Documentation

Refer to the documents directory for the project presentation, report, and the original authors' paper.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
documents		documents
images		images
scripts		scripts
C++.gitignore		C++.gitignore
LICENSE		LICENSE
Perl.gitignore		Perl.gitignore
Python.gitignore		Python.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INDELseek Algorithm Reimplementation and Optimization

Contributors

Workflow and Tools

Key Components

Tools and Datasets

Results

Installation

Cloning the Repository

Required Tools

Dependencies

Datasets

Usage

Documentation

About

Releases

Packages

Languages

License

GiatrasKon/INDELseek-Optimized-Implementations

Folders and files

Latest commit

History

Repository files navigation

INDELseek Algorithm Reimplementation and Optimization

Contributors

Workflow and Tools

Key Components

Tools and Datasets

Results

Installation

Cloning the Repository

Required Tools

Dependencies

Datasets

Usage

Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages