Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5ttgen bug with recent version fsl_sub and Debian/Ubuntu Gridengine #2595

Open
glasserm opened this issue Mar 4, 2023 · 4 comments
Open

Comments

@glasserm
Copy link

glasserm commented Mar 4, 2023

In fsl.py --> check_first it uses SGE_ROOT to check for the sge queuing system; however, not all sge queuing systems use SGE_ROOT (e.g., Gridengine on Debian/Ubuntu). If both SGE_ROOT or FSLPARALLEL were checked for, this would work on more folks' systems. I am not a python coder, but on my system simply replacing SGE_ROOT with FSLPARALLEL resolved the issue. I don't know if SGE_ROOT is still used by others, however.

@Lestropie
Copy link
Member

Thanks for the report Matt. Testing FSLPARALLEL in fsl.check_first() might resolve non-completion in some use cases, though it would potentially result in hanging rather than erroring out in some cases where one or more FIRST jobs fail and SGE is not being used.

It's not clear exactly what issue has been encountered and how:

  • fsl_sub itself checks for SGE_ROOT so I don't think that an SGE environment could be being used in the absence of such;
  • fsl_sub could use FSLPARALLEL > 1 without using SGE, and when that happens it calls wait, which means that all subprocesses should have completed by the time run_first_all completes. Absence of expected VTK files in such a scenario should therefore ideally result in an error message, rather than waiting indefinitely for files that will never appear.

Is there a discussion that's happened elsewhere that you can link me to?

@glasserm
Copy link
Author

glasserm commented Mar 5, 2023

Debian/Ubuntu gridengine and FSL function without SGE_ROOT to launch jobs on SGE. On my system, swapping in FSLPARALLEL worked to allow the MRTrix code to wait to complete the jobs. It failed immediately otherwise. I don't see any other environment variables that would work, though I know there is a lot of ongoing modification of fsl_sub in recent versions of FSL and it is certainly possible that FSLPARALLEL is a hold over from older versions of FSL (it is being set in my .bashrc and unsetting it does not seem to prevent jobs from going to SGE through fsl_sub either). Perhaps talking to some of the FSL developers would suggest a better solution for wrapping scripts that call fsl_sub in recent FSL versions? fsl_sub (and first) still returns a job ID so perhaps monitoring the completion of that would be a solution.

@Lestropie
Copy link
Member

Lestropie commented Mar 5, 2023

I was looking at bash fsl_sub in 6.0.5.2; looks like there's a whole Python module now... SGE_ROOT doesn't even appear in that repository... I wasn't aware of those changes.

Capturing the hold job ID from run_first_all might be an option now. I avoided this in the past as it precludes the complete separation between execution and verification. And I've never had an SGE setup on which to myself test data where FIRST is and is not successful, I've just iteratively revised based on user-reported issues.

Manually setting SGE_ROOT in your environment to trick fsl.check_first() is an alternative hack fix that doesn't require modification of code.
(Edit: Obviously not a universal solution, but will get anyone by until I figure out how I want to change the code)

@glasserm
Copy link
Author

glasserm commented Mar 5, 2023

I tried that, but it broke the new fsl_sub when I set it to something random. I couldn't figure out what it actually should be because Debian/Ubuntu gridengine is scattered around multiple folders.

Lestropie added a commit that referenced this issue Mar 7, 2023
As per discussion in #2595.
Function will utilise the PID reported by the run_first_all script, as this seems to be intended for use in determining when all processing tasks have been completed.
Lestropie added a commit that referenced this issue Mar 22, 2023
If possible, use fsl_sub command to halt execution until all jobs have completed.
Result of discussion in #2597.
Addresses #2595.
Lestropie added a commit that referenced this issue Mar 22, 2023
If possible, use fsl_sub command to halt execution until all jobs have completed.
Result of discussion in #2597.
Addresses #2595.
Replicates some contents of bd3f19e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants