Saradomin: Hi-C Architect for Neural Networks

Introduction

The Saradomin project is designed to facilitate the processing of Hi-C Fastq data, transforming raw sequencing reads into structured datasets suitable for neural network models.

Project goal

Objective: Transform Hi-C data for use in neural network (NN) models, ensuring the data is optimally formatted for training and validation purposes.

Key Tasks:

Data Transformation: Convert raw Hi-C data into a format suitable for neural network processing. This includes normalizing interaction frequencies and structuring the data into input features that a NN can efficiently process.
Dataset Division: Split the transformed Hi-C data into training and testing subsets. Ensure that each subset is representative of the overall data characteristics to maintain model reliability across different data points.
Customization of Reads Pair Disruption: Introduce a configurable disruption ratio for paired-end reads within the Hi-C data. This involves selectively altering the linkage of read pairs to simulate varying degrees of disruption, which is crucial for training the model to handle real-world variations and anomalies in Hi-C data.

Outcome: A well-prepared dataset that allows a neural network to learn from both undisturbed and artificially disrupted Hi-C interactions, thereby enhancing the model's robustness and applicability to real-world biological data analysis.

Installation

Prerequisites

Python 3.10 or higher

Environment Setup

We recommend using a Python virtual environment for dependency management:

Create a Virtual Environment:
```
python3 -m venv venv
```
Activate the Virtual Environment:
```
source venv/bin/activate
```
Install Dependencies: Navigate to the project's root directory and run:
```
pip3 install -r requirements.txt
```

Getting Started

Preparing Your Data

You can use snippet of HiC data in test_data or download whole HiC dataset

Configuration

Tailor Saradomin to your project needs by adjusting its configuration:

Custom Configuration: Pass environment variable. E.g Create .env file. (Variables in .env overrides config.py)
Local Configuration: For quick adjustments, modify the config.py file in the root directory. This approach is recommended for temporary changes or small-scale projects.

Structure of Output File

The data file is structured into two distinct sections: the header and the data content.

Header

The header section encapsulates critical metadata about the file, including the date of creation and the schema of the read.

#HEADER#
#DATE=2024-04-22T12:47:59.038200
#pre_processing_version=[0, 1, 0]
#mapping: {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'N': 4}
#schema=1.row UID	2.row NUCLEOTIDE	3.row SCORE
####END####

Data Content

The Read has always 3 lines in file 1. uid 2. Sequence 3. Score Each nucleotide is mapped to specific number. The score(Quality score) is represented by ASCII number.

1
[2, 3]
[66, 66]

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
saradomin		saradomin
test_data/fastq/hg19		test_data/fastq/hg19
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
problems.txt		problems.txt
requirements.txt		requirements.txt
run.py		run.py
run_tests.py		run_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Saradomin: Hi-C Architect for Neural Networks

Introduction

Table of Contents

Project goal

Installation

Prerequisites

Environment Setup

Getting Started

Preparing Your Data

Configuration

Structure of Output File

Header

Data Content

About

Releases

Packages

Languages

License

ProfH2SO4/saradomin

Folders and files

Latest commit

History

Repository files navigation

Saradomin: Hi-C Architect for Neural Networks

Introduction

Table of Contents

Project goal

Installation

Prerequisites

Environment Setup

Getting Started

Preparing Your Data

Configuration

Structure of Output File

Header

Data Content

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages