Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
dataset.py		dataset.py
dataset_fold_split.py		dataset_fold_split.py
dataset_generator.py		dataset_generator.py
finetune_promoter.sh		finetune_promoter.sh
finetune_promoter_16000.sh		finetune_promoter_16000.sh
finetune_promoter_16000_nar_rmt_large.sh		finetune_promoter_16000_nar_rmt_large.sh
finetune_promoter_16000_nar_rmt_large_pretrained.sh		finetune_promoter_16000_nar_rmt_large_pretrained.sh
finetune_promoter_2000.sh		finetune_promoter_2000.sh
finetune_promoter_2000_large.sh		finetune_promoter_2000_large.sh
finetune_promoter_2000_nar_large.sh		finetune_promoter_2000_nar_large.sh
finetune_promoter_2000_nar_large_rmt.sh		finetune_promoter_2000_nar_large_rmt.sh
finetune_promoter_large.sh		finetune_promoter_large.sh
finetune_promoter_rmt.sh		finetune_promoter_rmt.sh
finetune_promoter_rmt_pretrained.sh		finetune_promoter_rmt_pretrained.sh
hg38_len_2000.fa.txt		hg38_len_2000.fa.txt
hg38_len_300.fa.txt		hg38_len_300.fa.txt
hg38_promoters_len_300_dataset.csv		hg38_promoters_len_300_dataset.csv
run_promoter_finetuning.py		run_promoter_finetuning.py
run_promoter_finetuning_rmt.py		run_promoter_finetuning_rmt.py

README.md

Promoter prediction

We compared performance of GENA-LM models to

Dataset Preparation

Step 1. Download data

Original data was from EPDNew: https://epd.epfl.ch/EPDnew_select.php; note that the EPDnew database was recently moved here. We used EPDNew select tool to fetch human promoter sequences (hg38). Four different sequence lengths are used:

Length 300. From -249 to 50. Results in a file hg38_len_300.fa.txt
Length 2000. From -1000 to 999. Results in a file hg38_len_2000.fa.txt
Length 8000 (like in BigBird paper). From -5000 to 2999. Results in a file hg38_len_8000.fa.txt
Length 16000. From -8000 to 7999. Results in a file hg38_len_16000.fa.txt

Step 2. Create a dataset

Run the script `dataset_generator.py`` with fasta files obtained in previous step.

>> python dataset_generator.py
hg38_len_300.fa.txt

The script treats promoter sequences as positive targets and generates negative samples, following the same procedure as in DeePromoter paper. Results in:

hg38_promoters_len_300_dataset.csv

Step 3. Split to 5 folds

Run the dataset_fold_split.py script with csv files obtained from dataset generator

>> python dataset_fold_split.py
hg38_promoters_len_300_dataset.csv

Results in five csv files named from fold_1.csv to fold_5.csv and corresponding train/valid/test splits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

promoter_prediction

promoter_prediction

README.md

Promoter prediction

Dataset Preparation

Step 1. Download data

Step 2. Create a dataset

Step 3. Split to 5 folds

Files

promoter_prediction

Directory actions

More options

Directory actions

More options

Latest commit

History

promoter_prediction

Folders and files

parent directory

README.md

Promoter prediction

Dataset Preparation

Step 1. Download data

Step 2. Create a dataset

Step 3. Split to 5 folds