-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto-scale experiment plan #419
auto-scale experiment plan #419
Conversation
doc/autoscale_experiment.md
Outdated
## Before Starting The Experiment | ||
|
||
- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo. | ||
- We have 240 CPU cores and 80 GPU cards totally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不需要说我们现在有多少资源。可以列一个表格,最终实验的环境填写到这个表格里
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以不写真实的资源数,这里其实也只是为了计算实验数据时使用的。
doc/autoscale_experiment.md
Outdated
- Start a training job(jobA) with 2~100 trainer instances(2 pservers, 1 master), the trainers will be scaled immediately to use the maximum free resources in the cluster. | ||
- Start another training job(jobB) with 50~100 trainer instances(2 pservers, 1 master), there is no enough resource, the job will wait for the adequacy of the resource. | ||
|
||
### With auto-scaling Job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test case可以分成多组,分别测试不同的场景:
- 本地环境测试
- 离线集群
- 离线集群CPU/GPU混合调度(最大化GPU利用率)
- 在离线集群混部(在线服务高优先级)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在线服务高优先级这个可以启动一些HPA的Nginx的服务, 然后在外面用压力测试的方式query Nginx的服务, 然后验证下Nginx的pod数量增加, 同时ML的pod数量被减小. @helinwang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的理解我们还没有优先级的概念, Nginx的pod数量增加
应该是手动完成的,现在还做不到压力增大Nginxpod数量自动增加
吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们可以用hpa来自动控制Nginx数量吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HPA确实可以实现自动控制Nginx数量,也比较好设定。另外,只要Nginx压力增大,CPU和Mem的使用量都会增加,我们能监测到,不一定需要HPA哈。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以用HPA来设置CPU的thredhold来达到控制Nginx数量的目的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero 增加了第2,4 两种场景,第1和3感觉没有什么具体的数值可以用来做对比,或者只是介绍性的描述?
- At least 4 kubernetes nodes, each node should have 2 GPU cards at least. | ||
- Dataset prepared to multiple files with the RecordIO format. | ||
|
||
## Experiment Metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不确定是否叫Metric? 对比的维度怎么表示?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
或者这里列一个表格更清楚一些?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
每个Test Case下加了一个表格,实际实验结果可以增加采样点用图表来表示。
1. Job-A will be scaled down and job-A and job-B will run in the cluster at the same time, and they will use the maximum free resources. | ||
|
||
- Experiment metrics | ||
1. Compare the **CPU utils** with auto-scaling training job and general training job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can add cluster wide overall CPU / GPU utils.
$ kubectl describe nodes | grep -A 2 -e "^\\s*CPU Requests"
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
3865m (96%) 3600m (90%) 2760Mi (47%) 2770Mi (47%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with CPU utils, but maybe there is no difference between the CPU and GPU resource for the auto-scaling feature? How about we only use CPU as the computing resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
或者我们可以按 @typhoonzero 在#419 (comment) 这里提到的第三种场景,测试CPU和GPU混合调度的场景,但感觉这可能不属于auto-scaling的特性范围了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要不测的时候CPU和GPU utils都测一下(应该就是一行命令的事情),用不用GPU utils最后再决定?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU utils的数据采集可能会稍微复杂一些(需要扫描Pod,或者读取influxDB中的数据来获取,Kubernetes API无法直接取到)不过可以都测一下。
Discussed with @putcn, perhaps we need:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT except one comment
doc/autoscale_experiment.md
Outdated
|
||
## Before Starting The Experiment | ||
|
||
- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the demos in book should be tested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@helinwang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Fixed #395
Fixed #413