Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Autoscaling Controller #385

Merged
merged 72 commits into from
Oct 21, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
5b95916
publish files
typhoonzero Sep 12, 2017
04ab2f5
publish files
typhoonzero Sep 12, 2017
63f1c95
fix travis error
typhoonzero Sep 12, 2017
f227e48
Merge branch 'develop' of https://github.com/PaddlePaddle/cloud into …
typhoonzero Sep 19, 2017
34bbe7a
Merge branch 'develop' of https://github.com/PaddlePaddle/cloud into …
typhoonzero Sep 30, 2017
bb23b77
Merge branch 'develop' of https://github.com/PaddlePaddle/cloud into …
typhoonzero Oct 10, 2017
8f602e7
WIP
typhoonzero Oct 10, 2017
453986e
WIP
helinwang Oct 10, 2017
f05985d
change folder structure: move controller/* to controller/k8s
helinwang Oct 10, 2017
1e99ba1
move operator/* to controller/
helinwang Oct 10, 2017
a72a6bb
add cluster abstraction
helinwang Oct 11, 2017
ed564fe
improve cluster interface
helinwang Oct 11, 2017
e10df61
rename k8s package name
helinwang Oct 11, 2017
5e5ceb9
rename Controller to Autoscaler
helinwang Oct 11, 2017
8efa7d8
refine naming and structure
typhoonzero Oct 11, 2017
0bd78c4
have crash bug
typhoonzero Oct 11, 2017
3199e78
fix glog flag duplicate
typhoonzero Oct 11, 2017
0b4deba
adjust comments
helinwang Oct 11, 2017
9af94cf
event fetch ok
typhoonzero Oct 12, 2017
50cfe7f
update
typhoonzero Oct 12, 2017
c4c7208
autoscale function
typhoonzero Oct 13, 2017
d65a60a
update
typhoonzero Oct 13, 2017
9bda7fd
not tested scaling
typhoonzero Oct 13, 2017
0253fc0
improvements
helinwang Oct 14, 2017
2461395
use Go idiomatic constants
helinwang Oct 14, 2017
4860c27
Merge branch 'develop' of https://github.com/PaddlePaddle/cloud into …
typhoonzero Oct 16, 2017
936053f
update
typhoonzero Oct 16, 2017
e75e8b7
Merge branch 'controller' of https://github.com/PaddlePaddle/cloud in…
typhoonzero Oct 16, 2017
7575bdd
polish and add TODO
helinwang Oct 16, 2017
241fad0
use channel for autoscaler event handling to avoid using mutex
helinwang Oct 16, 2017
22c161e
remove TODO comment that is done.
helinwang Oct 16, 2017
1c8fa5d
rename autoscaler event handler
helinwang Oct 16, 2017
ba1b17f
make all tests pass.
helinwang Oct 16, 2017
587a266
fix build
helinwang Oct 16, 2017
75407df
try fix travis build
helinwang Oct 17, 2017
1f76064
try fix travis build
helinwang Oct 17, 2017
28dd711
adding testcase, still need test
typhoonzero Oct 17, 2017
4640b59
fix test case
typhoonzero Oct 17, 2017
526ea80
scale up: consider both GPU and CPU constraint. And add comments
helinwang Oct 17, 2017
27f4866
Simply Cluster interface, update scaling algorithm.
helinwang Oct 17, 2017
11ec9dd
Restructure controller and autoscalar packages
helinwang Oct 17, 2017
d78b11e
fix typo in comment
helinwang Oct 17, 2017
9da79b2
Add unit test for scaleDryRun and scaleAllDryRun
helinwang Oct 18, 2017
69675b4
code refine
typhoonzero Oct 18, 2017
1cbca13
add mnist example
typhoonzero Oct 18, 2017
76265b3
Add missing go files, fix unit test
helinwang Oct 18, 2017
9b5da6f
move controller
typhoonzero Oct 19, 2017
6feb1ba
Fix SyncResource returning decreasing free resource over time.
helinwang Oct 19, 2017
3143d00
Improve autoscaling documentation, change k8s config alway pull image.
helinwang Oct 19, 2017
d67d37a
merge1
typhoonzero Oct 19, 2017
1523923
Fix crash by make map
helinwang Oct 19, 2017
65a24a4
merge2
typhoonzero Oct 19, 2017
f59c750
Merge branch 'controller' of https://github.com/PaddlePaddle/cloud in…
typhoonzero Oct 19, 2017
bbac308
refine cluster.go and update
typhoonzero Oct 19, 2017
6a6811f
add cfs and utils
typhoonzero Oct 19, 2017
22bdcd7
fix glide nested vendor
typhoonzero Oct 19, 2017
6053c2d
fix ci
typhoonzero Oct 19, 2017
45a3dcb
add scale down
typhoonzero Oct 19, 2017
d67ade4
add mnist ft demo
typhoonzero Oct 19, 2017
d0902d9
Rename method, avoid unnecessarily passing pointer, refactor unit test
helinwang Oct 19, 2017
e9f6339
Add InitContainers into cluster resource utilization calculation, pol…
helinwang Oct 19, 2017
01b3411
Get the lastest TrainerJob before updating it, with retry.
helinwang Oct 20, 2017
6b55869
Support TrainingJob update.
helinwang Oct 20, 2017
ee6a6fc
Add TODO for fixing incorrect training job pod count.
helinwang Oct 20, 2017
b11fab3
fix scale before running
typhoonzero Oct 20, 2017
b855f38
Rename JobRunning to JobPods
helinwang Oct 20, 2017
c7a6fa2
Change imagePullPolicy to Always
helinwang Oct 20, 2017
919d323
Update tutorial
helinwang Oct 20, 2017
b20826f
Update tutorial
helinwang Oct 20, 2017
e225843
Update tutorial
helinwang Oct 20, 2017
070af1e
Update autoscale.md
helinwang Oct 20, 2017
a2d1adc
Temporately change trainer docker image name
helinwang Oct 20, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions go/autoscaler/autoscaler.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ type Cluster interface {
// UpdateTrainerJob updates the trainer job spec.
UpdateTrainerJob(job *batchv1.Job) error

// IsJobAllRunning check if all the pods are in "Running" status.
IsJobAllRunning(job *paddlejob.TrainingJob) bool
// JobRunning check if all the pods are in "Running" status.
JobRunning(job *paddlejob.TrainingJob) (bool, error)
}

type job struct {
Expand Down Expand Up @@ -362,9 +362,19 @@ func (a *Autoscaler) Monitor() {
continue
}
j.TrainerJob = tj
// scale jobs only when all pods are in running status.
// pods are pending/starting/terminating if the job is just submited or just scaled up/down.
if a.cluster.IsJobAllRunning(j.Config) {

// Scale jobs only when it's running (defined
// by all pods are in the "running"
// status). Pods are
// pending/starting/terminating if the job is
// just submited or just scaled up/down.
running, err := a.cluster.JobRunning(j.Config)
if err != nil {
log.Errorln("Get if job is running failed:", err)
continue
}

if running {
js = append(js, j)
}
}
Expand Down
Loading