-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latency in job deployment is related to number of tasks in an unexpected way #1146
Comments
|
Continuing discussion in #1107: I think 2 above is unnecessarily costly and would speed it up a lot if avoided. |
2 is fixed... The problem with your tasks is that some tasks is "aborted" for whatever reason. Maybe the scheduler killed it, maybe something wrong with sos, but whereas they are removed from slurm ... . I have submitted a patch to collect .err files for aborted jobs but I am not sure if there are any. |
Oh is this the reason for #1149 ? Sorry not sure which issue you are referring to. Tasks are aborted likely due to #1147 (something wrong with SoS)? After running my pipeline there will be a number of |
Still checking. How to check the details of failed jobs on slurm? |
I only know |
I believe at least for my example on the cluster, problems 2 and 3 are still not resolved. Particularly 2 -- other steps are still analyzed:
(I specified |
2 is needed because we need to know what resources that auxiliary steps |
Yes I think so, because all my other steps are not even auxiliary steps. They are just other steps in the script -- in SoS we feature in consolidating many scripts into one. |
Along the lines of the MWE in #1139 where I have 34K jobs, it takes a few minutes to prepare the jobs. I then turned on the
-v4
option and see where it spends its time. I compared it to using only the first 30 jobs out of the 34K. The 30 jobs deploys fast. The 34K however, seems to be spending lots of time on theseAnalyzing
steps:takes a while to analyze each. My questions:
dap_1
) but all workflow steps get analyzed. Seems a waste.cannot be determined: name '_input' is not defined
keeps showing up -- is it necessary?The text was updated successfully, but these errors were encountered: