Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion and Experiment stuck at Running when suggestionCount < requests #1494

Closed
midhun1998 opened this issue Mar 23, 2021 · 10 comments
Closed

Comments

@midhun1998
Copy link
Member

/kind bug

What steps did you take and what happened:

  1. Run an experiment following the grid search algorithm.
  2. parallelTrialCount specified was 5, maxTrialCount specified was 8 and maxFailedTrialCount specified was 2.
  3. We run this experiment multiple times and sometimes the suggested count returned by the suggestion controller is less than the request(which is maxTrialCount value) which causes the trial to get stuck at Running state which in this case was 8 and requested was 10. This causes the pipeline Katib component to never finish which in-turn fails the pipeline.

This is only seen with the grid search algorithm whereas the random search works fine.

The following logs are seen from the Katib controller:

`{"level":"error","ts":1616487038.0000834,"logger":"suggestion-controller","msg":"Reconcile Suggestion error","Suggestion":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/suggestion.(*ReconcileSuggestion).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/suggestion/suggestion_controller.go:175\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

{"level":"error","ts":1616487038.0001156,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"suggestion-controller","request":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

{"level":"info","ts":1616487043.898402,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.8985455,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.8985615,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.902022,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.9020426,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9020486,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.9060924,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.90611,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9061158,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.9130266,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.9130516,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9130602,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}
`

What did you expect to happen:
Expected the experiment to complete with all requested number of trials.

Anything else you would like to add:
Similar to issue: #1168

Environment:

  • Kubeflow version (kfctl version): 1.2 (v1beta1)
  • Kubernetes version: (use kubectl version): 1.17
@andreyvelich
Copy link
Member

Thank you for creating this @midhun1998!
Basically, Katib controller should delete extra Trials once you change parallelTrialCount: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/experiment_controller.go#L296-L305.

Did you change these params during Experiment run or once you submitted the Experiment ?

@midhun1998
Copy link
Member Author

midhun1998 commented Mar 23, 2021

Thanks for replying @andreyvelich . No, nothing was changed once experiment had started. These were configured before the run. Also for the same experiment when maxTrialCount was set to 10 and parallelTrialCount was set to 5 the suggestion had only 8 values and since requested was 10 the experiment never finished. I see that Delete Trials never got printed in the controller logs. Can you confirm if its a bug?

@andreyvelich
Copy link
Member

@midhun1998 Can you share the YAML for your examples, please ? And which version of Katib did you install ?

@midhun1998
Copy link
Member Author

midhun1998 commented Mar 25, 2021

@andreyvelich Unfortunately, I'm not allowed to share the image that was used in the trial. The Katib version installed was 0.10.0.

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: katib-d7c64e16-54fe-47ab-b744-121d62c49b40
  namespace: admin
 spec:
algorithm:
    algorithmName: grid
  maxFailedTrialCount: 2
  maxTrialCount: 8
  metricsCollectorSpec:
    collector:
      kind: StdOut
    source:
      filter:
        metricsFormat:
        - (loss)\s*:\s*((-?\d+)(\.\d+)?)
  objective:
    goal: 0.001
    metricStrategies:
    - name: loss
      value: min
    objectiveMetricName: loss
    type: minimize
  parallelTrialCount: 5
  parameters:
  - feasibleSpace:
      list:
      - "100"
    name: epochs
    parameterType: discrete
  - feasibleSpace:
      list:
      - "0.01"
    name: lr
    parameterType: discrete
  - feasibleSpace:
      list:
      - "3"
      - "5"
    name: num_layers
    parameterType: discrete
  - feasibleSpace:
      list:
      - "10"
      - "80"
    name: num_cells
    parameterType: discrete
  - feasibleSpace:
      list:
      - "32"
      - "64"
    name: batch_size
    parameterType: discrete
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: tensorflow
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: Number of epoch
      name: epochs
      reference: epochs
    - description: Learning rate for the training model
      name: learningRate
      reference: lr
    - description: Batch size
      name: batchSize
      reference: batch_size
    - description: Number of cells in each layer
      name: numCells
      reference: num_cells
    - description: Number of layers
      name: numLayers
      reference: num_layers
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: "false"
          spec:
            containers:
            - args:
              - python /opt/training.py --epochs=${trialParameters.epochs}
                --num_layers=${trialParameters.numLayers} --lr=${trialParameters.learningRate}
                --num_cells=${trialParameters.numCells} --batch_size=${trialParameters.batchSize}
              command:
              - sh
              - -c
              image: IMAGE
              name: tensorflow
            restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2021-03-19T06:27:36Z"
    lastUpdateTime: "2021-03-19T06:27:36Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-03-19T06:28:07Z"
    lastUpdateTime: "2021-03-19T06:28:07Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    bestTrialName: katib-d7c64e16-54fe-47ab-b744-121d62c49b40-2w8qqh5s
    observation:
      metrics:
      - latest: "4.283613390922547"
        max: "4.283613390922547"
        min: "4.283613390922547"
        name: loss
    parameterAssignments:
    - name: epochs
      value: "100"
    - name: lr
      value: "0.01"
    - name: num_layers
      value: "3"
    - name: num_cells
      value: "100"
    - name: batch_size
      value: "32"
  startTime: "2021-03-19T06:27:36Z"
  succeededTrialList:
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-2w8qqh5s
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-cvsgbstl
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-f8t8wtss
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-r28rcjfq
  - katib-d7c64e16-54fe-47ab-b744-121d62c49b40-vjvpgtcz
  trials: 5
  trialsSucceeded: 5

@andreyvelich
Copy link
Member

@midhun1998 Thank you for sharing the example.
It might be the case when Grid suggestion produces the Trials that can't be reconciled by the Controller, since your search space parameter combinations are strictly equal to the maxTrialCount.

Please, can you share the logs from the Grid suggestion pod and verify which Trials were not created?
From the Suggestion pod you should be able to see all Trials which are produced by the Grid Suggestion.

@midhun1998
Copy link
Member Author

@andreyvelich Katib controller pod logs was attached in the first comment of issue. Unfortunately, the suggestion pod couldn't be found as it was cleaned up by kubernetes.

@andreyvelich
Copy link
Member

Please can you try to run your Experiment again and check the logs from the Suggestion and Katib controller ?

@midhun1998
Copy link
Member Author

Sure

@stale
Copy link

stale bot commented Jul 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Jul 8, 2021
@stale
Copy link

stale bot commented Jul 29, 2021

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants