Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion stuck for HyperBand when len(response.ParameterAssignments) < requestNum #1168

Closed
czheng94 opened this issue Apr 24, 2020 · 5 comments

Comments

@czheng94
Copy link

czheng94 commented Apr 24, 2020

/kind bug

Suggestion should support cases where len(response.ParameterAssignments) < requestNum

What steps did you take and what happened:

  1. Run the following hyperband example:
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  name: hyperband-example
spec:
  parallelTrialCount: 9
  maxTrialCount: 81
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - Train-accuracy
  algorithm:
    algorithmName: hyperband
    algorithmSettings:
      - name: "resource_name"
        value: "--num-epochs"
      - name: "eta"
        value: "3"
      - name: "r_l"
        value: "9"
  maxFailedTrialCount: 9
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: --num-layers
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
    - name: --optimizer
      parameterType: categorical
      feasibleSpace:
        list:
        - sgd
        - adam
        - ftrl
    - name: --num-epochs
      parameterType: int
      feasibleSpace:
        min: "20"
        max: "20"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            template:
              spec:
                containers:
                - name: {{.Trial}}
                  image: docker.io/kubeflowkatib/mxnet-mnist
                  command:
                  - "python3"
                  - "/opt/mxnet-mnist/mnist.py"
                  - "--batch-size=64"
                  {{- with .HyperParameters}}
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                  resources:
                    requests:
                      cpu: 1
                      memory: "4Gi"
                      nvidia.com/gpu: 0
                    limits:
                      cpu: 1
                      memory: "4Gi"
                      nvidia.com/gpu: 0
                restartPolicy: Never

After the first 9 trials succeeded, spec.requests in the Suggestion will be increased by 9. However, it will be stuck at this state

$ kubectl get suggestion                                     
NAME                TYPE      STATUS   REQUESTED   ASSIGNED   AGE
hyperband-example   Running   True     18          9          142m

This is because the number of suggestion that can be generated from the HyperBand algorithm is only 3 at this time (there are only 3 candidates in the current bracket, and you can't jump to the next bracket before getting results from these 3 candidates). It's less than the requested number to the suggestion service. And according to the implementation below, it's going to return an error.

if len(response.ParameterAssignments) != requestNum {
err := fmt.Errorf("The response contains unexpected trials")
logger.Error(err, "The response contains unexpected trials", "requestNum", requestNum, "response", response)
return err
}

You will see the following logs from suggestion controller:

{"level":"error","ts":1587761510.246165,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"suggestion-controller","request":"user1/hyperband-example","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

I believe we should always allow len(response.ParameterAssignments) <= requestNum, especially for HyperBand and Grid search, where there will be a fixed number of parameter assignments following a certain heuristics.

What did you expect to happen:

The suggestion controller should update 3 parameter assignments returned by the suggestion service into its status, instead of throwing an error.

Anything else you would like to add:

As addressed above

Environment:

  • Kubeflow version: v1alpha3
  • Minikube version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.58

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@andreyvelich
Copy link
Member

Thank you for your issue @czheng94.
Yes, hyperband suggestion can't generate new Trials after first run. @gaocegege Is it possible for hyperband to create request number of Trials in the second getSuggestion call ?

@terrykong
Copy link

Perhaps this should be a separate issue, but I'm curious if it's possible to decouple maxTrialCount, r_l, and parallelTrialCount? In particular, I have a use case where I don't want to make parallelTrialCount too high, but I still want a large initial bracket (r_l).

Part of my issue is setting parallelTrialCount too high might not be possible since I don't have enough GPUs.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 1.00

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@andreyvelich
Copy link
Member

Ref issue: #1389.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants