Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Env var TRAINERS need to be dynamically fetched for auto-scaling #411

Open
putcn opened this issue Oct 18, 2017 · 4 comments
Open

Env var TRAINERS need to be dynamically fetched for auto-scaling #411

putcn opened this issue Oct 18, 2017 · 4 comments
Assignees

Comments

@putcn
Copy link

putcn commented Oct 18, 2017

In file docker/paddle_k8s, env TRAINERS is defined when trainer's yaml is created, which means it's static through Pod's life cycle.
And this value is used to judge if the job has reached max error limit(line:29 check_failed_cnt). This will not work when auto-scaling is in place. this number has to be dynamically fetched/computed.

@putcn
Copy link
Author

putcn commented Oct 18, 2017

Currently, me and @helinwang don't have a clear idea how or what this number should be set. maybe the max-instance count * some coefficient?

@helinwang
Copy link
Collaborator

I think this is used to fix the problem that in Paddle Cloud when the user submits a job with wrong config that keeps failing. Since in fault tolerant mode we allow the pods to fail, so as @putcn said, a dynamic value could be one solution. Another option is to have a larger constant value, or something configurable from the yaml.

@typhoonzero
Copy link
Collaborator

I think this is used to fix the problem that in Paddle Cloud when the user submits a job with wrong config that keeps failing.

It's true. It's critical when training job has over 50% of workers dead, or maybe 30%. So probably best solution is to dynamically get the total count of workers, and the fatal limit should be configured in the yaml.

@putcn
Copy link
Author

putcn commented Oct 24, 2017

should we move the logic of check_failed_cnt in paddle_k8s to the autoscale controller? since it's managing the total count of trainers?
the "checking fail count" part is kind of blocking the scaling. I see pods failed to work due to reasons not related to trainer.py, like failed to allocate memory, etc.

OSError: [Errno 12] Cannot allocate memory: '/workspace/recommender_system'
job returned 1...setting pod return message...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants