We compared performance of GENA-LM models to
- BigBird https://papers.nips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html
- DNABERT https://pubmed.ncbi.nlm.nih.gov/33538820/
Original data was from EPDNew: https://epd.epfl.ch/EPDnew_select.php; note that the EPDnew database was recently moved here. We used EPDNew select tool to fetch human promoter sequences (hg38). Four different sequence lengths are used:
- Length 300. From -249 to 50. Results in a file hg38_len_300.fa.txt
- Length 2000. From -1000 to 999. Results in a file hg38_len_2000.fa.txt
- Length 8000 (like in BigBird paper). From -5000 to 2999. Results in a file hg38_len_8000.fa.txt
- Length 16000. From -8000 to 7999. Results in a file hg38_len_16000.fa.txt
Run the script `dataset_generator.py`` with fasta files obtained in previous step.
>> python dataset_generator.py
hg38_len_300.fa.txt
The script treats promoter sequences as positive targets and generates negative samples, following the same procedure as in DeePromoter paper. Results in:
hg38_promoters_len_300_dataset.csv
Run the dataset_fold_split.py script with csv files obtained from dataset generator
>> python dataset_fold_split.py
hg38_promoters_len_300_dataset.csv
Results in five csv files named from fold_1.csv to fold_5.csv and corresponding train/valid/test splits.