Env var TRAINERS need to be dynamically fetched for auto-scaling #411

putcn · 2017-10-18T23:06:29Z

In file docker/paddle_k8s, env TRAINERS is defined when trainer's yaml is created, which means it's static through Pod's life cycle.
And this value is used to judge if the job has reached max error limit(line:29 check_failed_cnt). This will not work when auto-scaling is in place. this number has to be dynamically fetched/computed.

putcn · 2017-10-18T23:12:56Z

Currently, me and @helinwang don't have a clear idea how or what this number should be set. maybe the max-instance count * some coefficient?

helinwang · 2017-10-18T23:35:45Z

I think this is used to fix the problem that in Paddle Cloud when the user submits a job with wrong config that keeps failing. Since in fault tolerant mode we allow the pods to fail, so as @putcn said, a dynamic value could be one solution. Another option is to have a larger constant value, or something configurable from the yaml.

typhoonzero · 2017-10-19T01:39:39Z

I think this is used to fix the problem that in Paddle Cloud when the user submits a job with wrong config that keeps failing.

It's true. It's critical when training job has over 50% of workers dead, or maybe 30%. So probably best solution is to dynamically get the total count of workers, and the fatal limit should be configured in the yaml.

putcn · 2017-10-24T21:44:44Z

should we move the logic of check_failed_cnt in paddle_k8s to the autoscale controller? since it's managing the total count of trainers?
the "checking fail count" part is kind of blocking the scaling. I see pods failed to work due to reasons not related to trainer.py, like failed to allocate memory, etc.

OSError: [Errno 12] Cannot allocate memory: '/workspace/recommender_system'
job returned 1...setting pod return message...

putcn assigned Yancey1989 Oct 18, 2017

putcn added the need be discussed label Oct 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Env var TRAINERS need to be dynamically fetched for auto-scaling #411

Env var TRAINERS need to be dynamically fetched for auto-scaling #411

putcn commented Oct 18, 2017

putcn commented Oct 18, 2017

helinwang commented Oct 18, 2017

typhoonzero commented Oct 19, 2017

putcn commented Oct 24, 2017

Env var TRAINERS need to be dynamically fetched for auto-scaling #411

Env var TRAINERS need to be dynamically fetched for auto-scaling #411

Comments

putcn commented Oct 18, 2017

putcn commented Oct 18, 2017

helinwang commented Oct 18, 2017

typhoonzero commented Oct 19, 2017

putcn commented Oct 24, 2017