You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In file docker/paddle_k8s, env TRAINERS is defined when trainer's yaml is created, which means it's static through Pod's life cycle.
And this value is used to judge if the job has reached max error limit(line:29 check_failed_cnt). This will not work when auto-scaling is in place. this number has to be dynamically fetched/computed.
The text was updated successfully, but these errors were encountered:
I think this is used to fix the problem that in Paddle Cloud when the user submits a job with wrong config that keeps failing. Since in fault tolerant mode we allow the pods to fail, so as @putcn said, a dynamic value could be one solution. Another option is to have a larger constant value, or something configurable from the yaml.
I think this is used to fix the problem that in Paddle Cloud when the user submits a job with wrong config that keeps failing.
It's true. It's critical when training job has over 50% of workers dead, or maybe 30%. So probably best solution is to dynamically get the total count of workers, and the fatal limit should be configured in the yaml.
should we move the logic of check_failed_cnt in paddle_k8s to the autoscale controller? since it's managing the total count of trainers?
the "checking fail count" part is kind of blocking the scaling. I see pods failed to work due to reasons not related to trainer.py, like failed to allocate memory, etc.
OSError: [Errno 12] Cannot allocate memory: '/workspace/recommender_system'
job returned 1...setting pod return message...
In file docker/paddle_k8s, env TRAINERS is defined when trainer's yaml is created, which means it's static through Pod's life cycle.
And this value is used to judge if the job has reached max error limit(line:29 check_failed_cnt). This will not work when auto-scaling is in place. this number has to be dynamically fetched/computed.
The text was updated successfully, but these errors were encountered: