Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency in job deployment is related to number of tasks in an unexpected way #1146

Closed
gaow opened this issue Jan 2, 2019 · 9 comments
Closed

Comments

@gaow
Copy link
Member

gaow commented Jan 2, 2019

Along the lines of the MWE in #1139 where I have 34K jobs, it takes a few minutes to prepare the jobs. I then turned on the -v4 option and see where it spends its time. I compared it to using only the first 30 jobs out of the 34K. The 30 jobs deploys fast. The 34K however, seems to be spending lots of time on these Analyzing steps:

TRACE: Analyzing susie_bhat_2
DEBUG: Args f'{_input:n}.pdf'
 cannot be determined: name '_input' is not defined
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'susie_bhat_1' of type str
TRACE: Analyzing susie_bhat_1
DEBUG: Args f'{_input:nn}.{annotation}.{suffix}.rds'
 cannot be determined: name '_input' is not defined
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'susie_z_1' of type str
TRACE: Analyzing susie_z_1
DEBUG: Args f'{_input:nn}.{annotation}.{suffix}.rds'
 cannot be determined: name '_input' is not defined
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'dap_1' of type str
TRACE: Analyzing dap_1
DEBUG: Args f'{_input:nn}.{annotation}.{suffix}.pkl'
 cannot be determined: name '_input' is not defined
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'caviar_1' of type str
TRACE: Analyzing caviar_1
DEBUG: Args f'{_input:nn}.{suffix}.rds'
 cannot be determined: name '_input' is not defined
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'caviar_2' of type str
TRACE: Analyzing caviar_2
DEBUG: Args f'{_input:n}.pdf'
 cannot be determined: name '_input' is not defined
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'caviar_2' of type str
...

takes a while to analyze each. My questions:

  1. Is this because the global variable has something of a list of length 34K thus dragging other steps slow?
  2. I was in fact only running one workflow step (eg dap_1) but all workflow steps get analyzed. Seems a waste.
  3. It has bothered me for a while: cannot be determined: name '_input' is not defined keeps showing up -- is it necessary?
@BoPeng
Copy link
Contributor

BoPeng commented Jan 3, 2019

  1. Yes, as I said before, right now the global section is executed multiple times so processing a large file over there will have some bad consequences. This is related to More efficient reset of global dictionary #1107 .
  2. yes, the other steps are analyzed as an auxiliary step... I agree this can be improved.
  3. That is a debug message... I can remove it.

@gaow
Copy link
Member Author

gaow commented Jan 3, 2019

Continuing discussion in #1107: I think 2 above is unnecessarily costly and would speed it up a lot if avoided.

BoPeng pushed a commit that referenced this issue Jan 3, 2019
@BoPeng
Copy link
Contributor

BoPeng commented Jan 3, 2019

2 is fixed... The problem with your tasks is that some tasks is "aborted" for whatever reason. Maybe the scheduler killed it, maybe something wrong with sos, but whereas they are removed from slurm ... . I have submitted a patch to collect .err files for aborted jobs but I am not sure if there are any.

@gaow
Copy link
Member Author

gaow commented Jan 3, 2019

but whereas they are removed from slurm

Oh is this the reason for #1149 ? Sorry not sure which issue you are referring to. Tasks are aborted likely due to #1147 (something wrong with SoS)? After running my pipeline there will be a number of err files under the current repository, with error message in #1147. Tracking one of those files might have status aborted?

@BoPeng
Copy link
Contributor

BoPeng commented Jan 3, 2019

Still checking. How to check the details of failed jobs on slurm?

@gaow
Copy link
Member Author

gaow commented Jan 3, 2019

I only know sacct -j <job_id>, or otherwise I rely on SoS: sos status <task_id> -v4

@gaow
Copy link
Member Author

gaow commented Jan 3, 2019

I believe at least for my example on the cluster, problems 2 and 3 are still not resolved. Particularly 2 -- other steps are still analyzed:

TRACE: Set step_name to 'susie_bhat_1' of type str
TRACE: Analyzing susie_bhat_1
DEBUG: Args f'{_input:nn}.{annotation}.{suffix}.rds'
 cannot be determined: name '_input' is not defined
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'susie_bhat_2' of type str
TRACE: Analyzing susie_bhat_2
DEBUG: Args f'{_input:n}.pdf'
 cannot be determined: name '_input' is not defined
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'susie_bhat_1' of type str
TRACE: Analyzing susie_bhat_1
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'susie_z_1' of type str
TRACE: Analyzing susie_z_1
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'dap_1' of type str
TRACE: Analyzing dap_1
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'caviar_1' of type str
TRACE: Analyzing caviar_1
TRACE: Set SOS_VERSION to '0.17.8' of type str
TRACE: Set step_name to 'caviar_2' of type str
TRACE: Analyzing caviar_2

(I specified susie_bhat step only.)

@BoPeng
Copy link
Contributor

BoPeng commented Jan 4, 2019

2 is needed because we need to know what resources that auxiliary steps provides (e.g. sos_variable). This step could be delayed at the target resolving phase though.

@gaow
Copy link
Member Author

gaow commented Jan 4, 2019

This step could be delayed at the target resolving phase though.

Yes I think so, because all my other steps are not even auxiliary steps. They are just other steps in the script -- in SoS we feature in consolidating many scripts into one.

BoPeng pushed a commit that referenced this issue Jan 4, 2019
@BoPeng BoPeng closed this as completed Jan 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants