-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion and Experiment stuck at Running when suggestionCount
< requests
#1494
Comments
Thank you for creating this @midhun1998! Did you change these params during Experiment run or once you submitted the Experiment ? |
Thanks for replying @andreyvelich . No, nothing was changed once experiment had started. These were configured before the run. Also for the same experiment when |
@midhun1998 Can you share the YAML for your examples, please ? And which version of Katib did you install ? |
@andreyvelich Unfortunately, I'm not allowed to share the image that was used in the trial. The Katib version installed was 0.10.0.
|
@midhun1998 Thank you for sharing the example. Please, can you share the logs from the Grid suggestion pod and verify which Trials were not created? |
@andreyvelich Katib controller pod logs was attached in the first comment of issue. Unfortunately, the suggestion pod couldn't be found as it was cleaned up by kubernetes. |
Please can you try to run your Experiment again and check the logs from the Suggestion and Katib controller ? |
Sure |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/kind bug
What steps did you take and what happened:
parallelTrialCount
specified was 5,maxTrialCount
specified was 8 andmaxFailedTrialCount
specified was 2.maxTrialCount
value) which causes the trial to get stuck atRunning
state which in this case was 8 and requested was 10. This causes the pipeline Katib component to never finish which in-turn fails the pipeline.This is only seen with the grid search algorithm whereas the random search works fine.
The following logs are seen from the Katib controller:
What did you expect to happen:
Expected the experiment to complete with all requested number of trials.
Anything else you would like to add:
Similar to issue: #1168
Environment:
kfctl version
): 1.2 (v1beta1)kubectl version
): 1.17The text was updated successfully, but these errors were encountered: