Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESPRESSO test not working with 112 threads #56

Open
pclavell opened this issue Jun 10, 2024 · 4 comments
Open

ESPRESSO test not working with 112 threads #56

pclavell opened this issue Jun 10, 2024 · 4 comments

Comments

@pclavell
Copy link

I've installed ESPRESSO and I was running the test and I find out that it can not run with 112 threads but 100,99 and 48 threads do work for some reason. I am running with 100 threads at the moment (and it works) so it is not a huge issue but just for your information I will attach the logs.
I am running this in a cluster with slurm in a 112 threads node.

Script:

echo "step1"
perl ESPRESSO_S.pl -A testdata/test_data_espresso_sirv/SIRV_C.gtf -L test.tsv -F testdata/test_data_espresso_sirv/SIRV2.fasta -O test_sirv -T 112

echo "step2"
perl ESPRESSO_C.pl -I test_sirv -F testdata/test_data_espresso_sirv/SIRV2.fasta -X 0 -T 112

echo "step3"
perl ESPRESSO_Q.pl -A testdata/test_data_espresso_sirv/SIRV_C.gtf -L test_sirv/test.tsv.updated -V test_sirv/samples_N2_R0_compatible_isoform.tsv -T 112
Fail to calculate divided file size, total file size 6210095, thread 112:  at ESPRESSO_C.pl line 1732.
No valid read_final.list can be found in test_sirv/0.
Perl exited with active threads:
	112 running and unjoined
	0 finished and unjoined
	0 running and detached

step1
testdata/test_data_espresso_sirv/SIRV2_3.sort.sam	0
[Mon Jun 10 16:27:42 2024] Loading reference
Worker 0 begins to scan: 
 testdata/test_data_espresso_sirv/SIRV2_3.sort.sam
Worker 0 finished reporting.
[Mon Jun 10 16:27:47 2024] Re-cluster all reads
[Mon Jun 10 16:27:47 2024] Loading annotation
[Mon Jun 10 16:27:47 2024] Summarizing annotated splice junctions for each read group
testdata/test_data_espresso_sirv/SIRV2_3.sort.sam(0)
0_1(0)
Worker 0 begins to scan: 
 testdata/test_data_espresso_sirv/SIRV2_3.sort.sam
Worker 0 finished reporting.
[Mon Jun 10 16:27:53 2024] ESPRESSO_S finished its work.
step2
[Mon Jun 10 16:27:54 2024] Loading splice junction info
[Mon Jun 10 16:27:54 2024] Requesting system to split SAMLIST into 112 pieces
Fail to calculate divided file size, total file size 6210095, thread 112.
Fatal error. Aborted.
step3
[Mon Jun 10 16:27:54 2024] Loading annotation
[Mon Jun 10 16:27:54 2024] Summarizing annotated isoforms
[Mon Jun 10 16:27:54 2024] Loading corrected splice junctions and alignment information by ESPRESSO

@EricKutschera
Copy link
Contributor

Here's the line for that error: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_C.pl#L1732

The code is checking that the split is at least 0.01 of the total file size and that check essentially limits the number of threads to 100. I'm not sure why that check was put in the code. I think it could probably be removed in a future version

@saberiato
Copy link

Hi @EricKutschera
I'm using ESPRESSO_C with 100 threads (--num_thread 100) and sort buffer size of 8GB (--sort_buffer_size 8G) on a sample with more than 3 million FLNC reads... and after ~24 hours, it hasn't finished yet...
When I check the running processes using htop, I can see ESPRESSO_C is only using 2 threads.
How can I actually parallelize ESPRESSO_C to speed up the process?

@EricKutschera
Copy link
Contributor

By default ESPRESSO creates 1 C step job per input file and within each C step job the threads work on different read groups. ESPRESSO defines the read groups by looking for alignments with overlapping coordinates. It sounds like you have 1 input file with 3 million reads for the same gene (FLNC). In that case all 3 million reads are being worked on by the same perl thread. If you're seeing 2 threads being used it could be because ESPRESSO runs nhmmer --cpu 2: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_C.pl#L1186

The snakemake workflow for ESPRESSO includes a parameter target_reads_per_espresso_c_job which limits the number of reads for each C step job and can split up a large read group: https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/snakemake_config.yaml#L28

You could use the snakemake, or manually use the scripts that the snakemake does to split up the C step into more jobs:
https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/scripts/split_espresso_s_output_for_c.py
https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/scripts/combine_espresso_c_output_for_q.py

Another option is to split up your reads into multiple input files so that each input file will be a separate C job. You can give each split file the same sample name in your -L samples.tsv

@saberiato
Copy link

Thanks Eric,
I'll look into these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants