Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsl.check_first(): Second attempt at refinement #2609

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Lestropie
Copy link
Member

Response to suggestion by @glasserm in #2597.

fsl_sub seemingly has a built-in functionality for halting execution until all asynchronous jobs have been completed. Hopefully grabbing the job ID of the final executed job in run_first_all and waiting on that will be sufficient.

What I don't know is the point in time at which this functionality was added to fsl_sub, and therefore whether this change will only work on some minimum version of FSL. But failures of this test should be caught and check_first() will revert to prior behaviour.

Requires testing on a functioning SGE cluster.

If possible, use fsl_sub command to halt execution until all jobs have completed.
Result of discussion in #2597.
Addresses #2595.
Replicates some contents of bd3f19e.
@glasserm
Copy link

glasserm commented Mar 22, 2023

This didn't work and I don't know enought python to fix it:

Command:  run_first_all -m none -s L_Accu,R_Accu,L_Caud,R_Caud,L_Pall,R_Pall,L_Puta,R_Puta,L_Thal,R_Thal,L_Amyg,R_Amyg,L_Hipp,R_Hipp -i T1.nii -o first -b -d

5ttgen: [ERROR] Unhandled Python exception:
5ttgen: [ERROR]   TypeError: bufsize must be an integer
5ttgen: [ERROR] Traceback:
5ttgen: [ERROR]   /home/brainmappers/miniconda3/bin/5ttgen:58 (in execute())
5ttgen: [ERROR]     alg.execute()
5ttgen: [ERROR]   /home/brainmappers/miniconda3/lib/mrtrix3/_5ttgen/fsl.py:192 (in execute())
5ttgen: [ERROR]     fsl.check_first('first', structures=sgm_structures, first_stdout=first_stdout)
5ttgen: [ERROR]   /home/brainmappers/miniconda3/lib/mrtrix3/fsl.py:53 (in check_first())
5ttgen: [ERROR]     return_code = subprocess.call('fsl_sub', '-j', str(job_id))
5ttgen: [ERROR]   /home/brainmappers/miniconda3/lib/python3.8/subprocess.py:340 (in call())
5ttgen: [ERROR]     with Popen(*popenargs, **kwargs) as p:
5ttgen: [ERROR]   /home/brainmappers/miniconda3/lib/python3.8/subprocess.py:757 (in __init__())
5ttgen: [ERROR]     raise TypeError("bufsize must be an integer")
5ttgen: Scratch directory retained; location: /media/myelin/brainmappers/Connectome_Project/YA_HCP_Final/100307/T1w/Diffusion/5TTfsl

@Lestropie
Copy link
Member Author

Whoops sorry, that was me ruining correspondence between code tested and code pushed. 18e5910 should at least resolve that.

On my own system fsl_sub -j runs, but says "Either supply a command to run or a parallel task file". I'm wondering if it's not possible for the -j option to be used in isolation, it only allows for a submitted job to wait for some other job to complete before itself commencing. In which case perhaps the qstat solution is necessary?

@glasserm
Copy link

glasserm commented Mar 22, 2023

I get the same. I wasn't aware of any fsl_sub halting functionality.

Command:  run_first_all -m none -s L_Accu,R_Accu,L_Caud,R_Caud,L_Pall,R_Pall,L_Puta,R_Puta,L_Thal,R_Thal,L_Amyg,R_Amyg,L_Hipp,R_Hipp -i T1.nii -o first -b -d
usage: fsl_sub [-h] [-a ARCH] [-c COPROCESSOR] [--coprocessor_multi COPROCESSOR_MULTI] [--coprocessor_class COPROCESSOR_CLASS] [--coprocessor_class_strict] [--coprocessor_toolkit COPROCESSOR_TOOLKIT] [-F] [-j JOBHOLD]
               [--not_requeueable] [--array_hold ARRAY_HOLD] [-l LOGDIR] [-m MAILOPTIONS] [-M MAILTO] [-n] [-N NAME] [-p -1023-0] [-q QUEUE] [-r RESOURCE] [--delete_job DELETE_JOB] [--extra EXTRA] [-R GB]
               [-s [threads,]SLOTS] [-t ARRAY_TASK] [--array_native ARRAY_NATIVE] [-x NUMBER] [--keep_jobscript] [--has_coprocessor COPROCESSOR_NAME] [--has_queues] [--export EXPORT] [--project PROJECT] [-S] [-T MINUTES]
               [--show_config] [-v] [-V] [-z file]
               ...
fsl_sub: error: No command or array task file provided

5ttgen: [ERROR] FSL FIRST has failed; 0 of 14 structures were segmented successfully (check /media/myelin/brainmappers/Connectome_Project/YA_HCP_Final/100307/T1w/Diffusion/5TTfsl/first.logs)

You could submit a job with a hold on the last FIRST JobID that, for example, creates a file indicating that everything has completed and then check for the presence of that file if you don't want to interact with qstat directly.

@Lestropie
Copy link
Member Author

You could submit a job with a hold on the last FIRST JobID

The thought had crossed my mind. Just wary of the extent to which this is becoming increasingly clumsy. Decided to avoid qstat as that would require an alternative usage of squeue in case fsl_sub uses SLURM. Let's see if f8477aa does the job; it has no issue running in the absence of SGE, and I've tested what I could.

@glasserm
Copy link

That worked. Thanks for your efforts here.

@Lestropie Lestropie marked this pull request as ready for review March 24, 2023 04:17
@Lestropie Lestropie requested a review from a team March 24, 2023 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants