You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It turned out that the created HPA was not configured with a correct object reference. The reference had only the Kind PyTorchJob, but no APIVersion, which should be kubeflow.org.
autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"False","lastTransitionTime":"2022-06-30T11:55:04Z","reason":"FailedGetScale","message":"the
HPA controller was unable to get the target''s current scale: no matches for
kind \"PyTorchJob\" in group \"\""}]'
Hi,
We were testing PyTorchJob with ElasticPolicy and HPA configurations on Kubeflow. But it seems that training-operator cannot create HPA for the job.
Environment: K8s 1.20.11, Kubeflow 1.5, training-operator 1.4
I used the ImageNet elastic training example in https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml and added
metrics
configurations to enable HPA. The full yaml:It turned out that the created HPA was not configured with a correct object reference. The reference had only the Kind PyTorchJob, but no APIVersion, which should be kubeflow.org.
I took a loot at the code and found that the operator did not set the APIVersion for the ScaleTargetRef https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/hpa.go#L76
The
autoscalingv2beta2.CrossVersionObjectReference
struct does have an APIVersion field.I'm not sure if this was due to my configuration errors or a bug.
Please let me know if you need further information.
Thanks,
Hanyu
The text was updated successfully, but these errors were encountered: