Suggestion and Experiment stuck at Running when `suggestionCount` < `requests` #1494

midhun1998 · 2021-03-23T08:35:25Z

/kind bug

What steps did you take and what happened:

Run an experiment following the grid search algorithm.
parallelTrialCount specified was 5, maxTrialCount specified was 8 and maxFailedTrialCount specified was 2.
We run this experiment multiple times and sometimes the suggested count returned by the suggestion controller is less than the request(which is maxTrialCount value) which causes the trial to get stuck at Running state which in this case was 8 and requested was 10. This causes the pipeline Katib component to never finish which in-turn fails the pipeline.

This is only seen with the grid search algorithm whereas the random search works fine.

The following logs are seen from the Katib controller:

`{"level":"error","ts":1616487038.0000834,"logger":"suggestion-controller","msg":"Reconcile Suggestion error","Suggestion":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/suggestion.(*ReconcileSuggestion).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/suggestion/suggestion_controller.go:175\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

{"level":"error","ts":1616487038.0001156,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"suggestion-controller","request":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

{"level":"info","ts":1616487043.898402,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.8985455,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.8985615,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.902022,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.9020426,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9020486,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.9060924,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.90611,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9061158,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.9130266,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.9130516,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9130602,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}
`

What did you expect to happen:
Expected the experiment to complete with all requested number of trials.

Anything else you would like to add:
Similar to issue: #1168

Environment:

Kubeflow version (kfctl version): 1.2 (v1beta1)
Kubernetes version: (use kubectl version): 1.17

The text was updated successfully, but these errors were encountered:

andreyvelich · 2021-03-23T12:52:47Z

Thank you for creating this @midhun1998!
Basically, Katib controller should delete extra Trials once you change parallelTrialCount: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/experiment_controller.go#L296-L305.

Did you change these params during Experiment run or once you submitted the Experiment ?

midhun1998 · 2021-03-23T14:18:55Z

Thanks for replying @andreyvelich . No, nothing was changed once experiment had started. These were configured before the run. Also for the same experiment when maxTrialCount was set to 10 and parallelTrialCount was set to 5 the suggestion had only 8 values and since requested was 10 the experiment never finished. I see that Delete Trials never got printed in the controller logs. Can you confirm if its a bug?

andreyvelich · 2021-03-23T19:16:28Z

@midhun1998 Can you share the YAML for your examples, please ? And which version of Katib did you install ?

midhun1998 · 2021-03-25T06:07:42Z

@andreyvelich Unfortunately, I'm not allowed to share the image that was used in the trial. The Katib version installed was 0.10.0.

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: katib-d7c64e16-54fe-47ab-b744-121d62c49b40
  namespace: admin
 spec:
algorithm:
    algorithmName: grid
  maxFailedTrialCount: 2
  maxTrialCount: 8
  metricsCollectorSpec:
    collector:
      kind: StdOut
    source:
      filter:
        metricsFormat:
        - (loss)\s*:\s*((-?\d+)(\.\d+)?)
  objective:
    goal: 0.001
    metricStrategies:
    - name: loss
      value: min
    objectiveMetricName: loss
    type: minimize
  parallelTrialCount: 5
  parameters:
  - feasibleSpace:
      list:
      - "100"
    name: epochs
    parameterType: discrete
  - feasibleSpace:
      list:
      - "0.01"
    name: lr
    parameterType: discrete
  - feasibleSpace:
      list:
      - "3"
      - "5"
    name: num_layers
    parameterType: discrete
  - feasibleSpace:
      list:
      - "10"
      - "80"
    name: num_cells
    parameterType: discrete
  - feasibleSpace:
      list:
      - "32"
      - "64"
    name: batch_size
    parameterType: discrete
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: tensorflow
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: Number of epoch
      name: epochs
      reference: epochs
    - description: Learning rate for the training model
      name: learningRate
      reference: lr
    - description: Batch size
      name: batchSize
      reference: batch_size
    - description: Number of cells in each layer
      name: numCells
      reference: num_cells
    - description: Number of layers
      name: numLayers
      reference: num_layers
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: "false"
          spec:
            containers:
            - args:
              - python /opt/training.py --epochs=${trialParameters.epochs}
                --num_layers=${trialParameters.numLayers} --lr=${trialParameters.learningRate}
                --num_cells=${trialParameters.numCells} --batch_size=${trialParameters.batchSize}
              command:
              - sh
              - -c
              image: IMAGE
              name: tensorflow
            restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2021-03-19T06:27:36Z"
    lastUpdateTime: "2021-03-19T06:27:36Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-03-19T06:28:07Z"
    lastUpdateTime: "2021-03-19T06:28:07Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    bestTrialName: katib-d7c64e16-54fe-47ab-b744-121d62c49b40-2w8qqh5s
    observation:
      metrics:
      - latest: "4.283613390922547"
        max: "4.283613390922547"
        min: "4.283613390922547"
        name: loss
    parameterAssignments:
    - name: epochs
      value: "100"
    - name: lr
      value: "0.01"
    - name: num_layers
      value: "3"
    - name: num_cells
      value: "100"
    - name: batch_size
      value: "32"
  startTime: "2021-03-19T06:27:36Z"
  succeededTrialList:
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-2w8qqh5s
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-cvsgbstl
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-f8t8wtss
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-r28rcjfq
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-vjvpgtcz
  trials: 5
  trialsSucceeded: 5

andreyvelich · 2021-03-26T17:06:33Z

@midhun1998 Thank you for sharing the example.
It might be the case when Grid suggestion produces the Trials that can't be reconciled by the Controller, since your search space parameter combinations are strictly equal to the maxTrialCount.

Please, can you share the logs from the Grid suggestion pod and verify which Trials were not created?
From the Suggestion pod you should be able to see all Trials which are produced by the Grid Suggestion.

midhun1998 · 2021-03-30T14:18:00Z

@andreyvelich Katib controller pod logs was attached in the first comment of issue. Unfortunately, the suggestion pod couldn't be found as it was cleaned up by kubernetes.

andreyvelich · 2021-04-01T16:31:03Z

Please can you try to run your Experiment again and check the logs from the Suggestion and Katib controller ?

midhun1998 · 2021-04-02T15:24:17Z

Sure

stale · 2021-07-08T02:19:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-07-29T04:54:34Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

google-oss-robot added the kind/bug label Mar 23, 2021

sidpalas mentioned this issue May 18, 2021

Grid Search stuck when parallelTrialCount < maxTrialCount #1534

Closed

stale bot added the lifecycle/stale label Jul 8, 2021

stale bot closed this as completed Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion and Experiment stuck at Running when `suggestionCount` < `requests` #1494

Suggestion and Experiment stuck at Running when `suggestionCount` < `requests` #1494

midhun1998 commented Mar 23, 2021

andreyvelich commented Mar 23, 2021

midhun1998 commented Mar 23, 2021 •

edited

Loading

andreyvelich commented Mar 23, 2021

midhun1998 commented Mar 25, 2021 •

edited

Loading

andreyvelich commented Mar 26, 2021

midhun1998 commented Mar 30, 2021

andreyvelich commented Apr 1, 2021

midhun1998 commented Apr 2, 2021

stale bot commented Jul 8, 2021

stale bot commented Jul 29, 2021

Suggestion and Experiment stuck at Running when suggestionCount < requests #1494

Suggestion and Experiment stuck at Running when suggestionCount < requests #1494

Comments

midhun1998 commented Mar 23, 2021

andreyvelich commented Mar 23, 2021

midhun1998 commented Mar 23, 2021 • edited Loading

andreyvelich commented Mar 23, 2021

midhun1998 commented Mar 25, 2021 • edited Loading

andreyvelich commented Mar 26, 2021

midhun1998 commented Mar 30, 2021

andreyvelich commented Apr 1, 2021

midhun1998 commented Apr 2, 2021

stale bot commented Jul 8, 2021

stale bot commented Jul 29, 2021

Suggestion and Experiment stuck at Running when `suggestionCount` < `requests` #1494

Suggestion and Experiment stuck at Running when `suggestionCount` < `requests` #1494

midhun1998 commented Mar 23, 2021 •

edited

Loading

midhun1998 commented Mar 25, 2021 •

edited

Loading