ESPRESSO test not working with 112 threads #56

pclavell · 2024-06-10T14:52:53Z

I've installed ESPRESSO and I was running the test and I find out that it can not run with 112 threads but 100,99 and 48 threads do work for some reason. I am running with 100 threads at the moment (and it works) so it is not a huge issue but just for your information I will attach the logs.
I am running this in a cluster with slurm in a 112 threads node.

Script:

echo "step1"
perl ESPRESSO_S.pl -A testdata/test_data_espresso_sirv/SIRV_C.gtf -L test.tsv -F testdata/test_data_espresso_sirv/SIRV2.fasta -O test_sirv -T 112

echo "step2"
perl ESPRESSO_C.pl -I test_sirv -F testdata/test_data_espresso_sirv/SIRV2.fasta -X 0 -T 112

echo "step3"
perl ESPRESSO_Q.pl -A testdata/test_data_espresso_sirv/SIRV_C.gtf -L test_sirv/test.tsv.updated -V test_sirv/samples_N2_R0_compatible_isoform.tsv -T 112

Fail to calculate divided file size, total file size 6210095, thread 112:  at ESPRESSO_C.pl line 1732.
No valid read_final.list can be found in test_sirv/0.
Perl exited with active threads:
	112 running and unjoined
	0 finished and unjoined
	0 running and detached


step1
testdata/test_data_espresso_sirv/SIRV2_3.sort.sam	0
[Mon Jun 10 16:27:42 2024] Loading reference
Worker 0 begins to scan: 
 testdata/test_data_espresso_sirv/SIRV2_3.sort.sam
Worker 0 finished reporting.
[Mon Jun 10 16:27:47 2024] Re-cluster all reads
[Mon Jun 10 16:27:47 2024] Loading annotation
[Mon Jun 10 16:27:47 2024] Summarizing annotated splice junctions for each read group
testdata/test_data_espresso_sirv/SIRV2_3.sort.sam(0)
0_1(0)
Worker 0 begins to scan: 
 testdata/test_data_espresso_sirv/SIRV2_3.sort.sam
Worker 0 finished reporting.
[Mon Jun 10 16:27:53 2024] ESPRESSO_S finished its work.
step2
[Mon Jun 10 16:27:54 2024] Loading splice junction info
[Mon Jun 10 16:27:54 2024] Requesting system to split SAMLIST into 112 pieces
Fail to calculate divided file size, total file size 6210095, thread 112.
Fatal error. Aborted.
step3
[Mon Jun 10 16:27:54 2024] Loading annotation
[Mon Jun 10 16:27:54 2024] Summarizing annotated isoforms
[Mon Jun 10 16:27:54 2024] Loading corrected splice junctions and alignment information by ESPRESSO

The text was updated successfully, but these errors were encountered:

EricKutschera · 2024-06-11T13:01:39Z

Here's the line for that error: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_C.pl#L1732

The code is checking that the split is at least 0.01 of the total file size and that check essentially limits the number of threads to 100. I'm not sure why that check was put in the code. I think it could probably be removed in a future version

saberiato · 2024-06-21T20:46:21Z

Hi @EricKutschera
I'm using ESPRESSO_C with 100 threads (--num_thread 100) and sort buffer size of 8GB (--sort_buffer_size 8G) on a sample with more than 3 million FLNC reads... and after ~24 hours, it hasn't finished yet...
When I check the running processes using htop, I can see ESPRESSO_C is only using 2 threads.
How can I actually parallelize ESPRESSO_C to speed up the process?

EricKutschera · 2024-06-24T17:23:42Z

By default ESPRESSO creates 1 C step job per input file and within each C step job the threads work on different read groups. ESPRESSO defines the read groups by looking for alignments with overlapping coordinates. It sounds like you have 1 input file with 3 million reads for the same gene (FLNC). In that case all 3 million reads are being worked on by the same perl thread. If you're seeing 2 threads being used it could be because ESPRESSO runs nhmmer --cpu 2: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_C.pl#L1186

The snakemake workflow for ESPRESSO includes a parameter target_reads_per_espresso_c_job which limits the number of reads for each C step job and can split up a large read group: https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/snakemake_config.yaml#L28

You could use the snakemake, or manually use the scripts that the snakemake does to split up the C step into more jobs:
https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/scripts/split_espresso_s_output_for_c.py
https://github.com/Xinglab/espresso/blob/v1.4.0/snakemake/scripts/combine_espresso_c_output_for_q.py

Another option is to split up your reads into multiple input files so that each input file will be a separate C job. You can give each split file the same sample name in your -L samples.tsv

saberiato · 2024-06-25T17:45:56Z

Thanks Eric,
I'll look into these.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESPRESSO test not working with 112 threads #56

ESPRESSO test not working with 112 threads #56

pclavell commented Jun 10, 2024

EricKutschera commented Jun 11, 2024

saberiato commented Jun 21, 2024

EricKutschera commented Jun 24, 2024

saberiato commented Jun 25, 2024

ESPRESSO test not working with 112 threads #56

ESPRESSO test not working with 112 threads #56

Comments

pclavell commented Jun 10, 2024

EricKutschera commented Jun 11, 2024

saberiato commented Jun 21, 2024

EricKutschera commented Jun 24, 2024

saberiato commented Jun 25, 2024