auto-scale experiment plan #419

Yancey1989 · 2017-10-19T12:46:11Z

Fixed #395
Fixed #413

typhoonzero · 2017-10-19T13:09:56Z

doc/autoscale_experiment.md

+## Before Starting The Experiment
+
+- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo.
+- We have 240 CPU cores and 80 GPU cards totally.


不需要说我们现在有多少资源。可以列一个表格，最终实验的环境填写到这个表格里

可以不写真实的资源数，这里其实也只是为了计算实验数据时使用的。

typhoonzero · 2017-10-19T13:10:23Z

doc/autoscale_experiment.md

+- Start a training job(jobA) with 2~100 trainer instances(2 pservers, 1 master), the trainers will be scaled immediately to use the maximum free resources in the cluster.
+- Start another training job(jobB) with 50~100 trainer instances(2 pservers, 1 master), there is no enough resource, the job will wait for the adequacy of the resource.
+
+### With auto-scaling Job


Test case可以分成多组，分别测试不同的场景：

本地环境测试

离线集群

离线集群CPU/GPU混合调度（最大化GPU利用率）

在离线集群混部（在线服务高优先级）

在线服务高优先级这个可以启动一些HPA的Nginx的服务, 然后在外面用压力测试的方式query Nginx的服务, 然后验证下Nginx的pod数量增加, 同时ML的pod数量被减小. @helinwang

我的理解我们还没有优先级的概念， Nginx的pod数量增加应该是手动完成的，现在还做不到压力增大Nginxpod数量自动增加吧？

我们可以用hpa来自动控制Nginx数量吗

HPA确实可以实现自动控制Nginx数量，也比较好设定。另外，只要Nginx压力增大，CPU和Mem的使用量都会增加，我们能监测到，不一定需要HPA哈。

可以用HPA来设置CPU的thredhold来达到控制Nginx数量的目的。

@typhoonzero 增加了第2，4 两种场景，第1和3感觉没有什么具体的数值可以用来做对比，或者只是介绍性的描述？

typhoonzero · 2017-10-19T13:13:15Z

doc/autoscale_experiment.md

+- At least 4 kubernetes nodes, each node should have 2 GPU cards at least.
+- Dataset prepared to multiple files with the RecordIO format.
+
+## Experiment Metric


不确定是否叫Metric? 对比的维度怎么表示？

或者这里列一个表格更清楚一些？

每个Test Case下加了一个表格，实际实验结果可以增加采样点用图表来表示。

helinwang · 2017-10-20T19:49:38Z

doc/autoscale_experiment.md

+    1. Job-A will be scaled down and job-A and job-B will run in the cluster at the same time, and they will use the maximum free resources.
+
+- Experiment metrics
+    1. Compare the **CPU utils** with auto-scaling training job and general training job.


Maybe can add cluster wide overall CPU / GPU utils.

$ kubectl describe nodes | grep -A 2 -e "^\\s*CPU Requests" CPU Requests CPU Limits Memory Requests Memory Limits ------------ ---------- --------------- ------------- 3865m (96%) 3600m (90%) 2760Mi (47%) 2770Mi (47%)

Agree with CPU utils, but maybe there is no difference between the CPU and GPU resource for the auto-scaling feature? How about we only use CPU as the computing resource?

或者我们可以按 @typhoonzero 在#419 (comment) 这里提到的第三种场景，测试CPU和GPU混合调度的场景，但感觉这可能不属于auto-scaling的特性范围了。

要不测的时候CPU和GPU utils都测一下（应该就是一行命令的事情），用不用GPU utils最后再决定？

GPU utils的数据采集可能会稍微复杂一些（需要扫描Pod，或者读取influxDB中的数据来获取，Kubernetes API无法直接取到）不过可以都测一下。

putcn · 2017-10-20T23:25:55Z

just found kube-dashboard has cpu and memory usage chart, that might be used in our report

helinwang · 2017-10-23T19:06:14Z

Discussed with @putcn, perhaps we need:

Check in all code for the experiment (easy for other people outside of PaddlePaddle team to reproduce an easy for us to reproduce in the future).
Collect time series data for plotting graph (e.g, cluster utilization v.s time).

typhoonzero

LGMT except one comment

typhoonzero · 2017-10-24T02:46:01Z

doc/autoscale_experiment.md

+
+## Before Starting The Experiment
+
+- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo.


I think all the demos in book should be tested!

Yancey1989 · 2017-10-24T05:43:50Z

@helinwang
Completely agree! Easy to reproduce the experiment is important.

typhoonzero

LGTM!

auto-scale experiment

ebc8545

Yancey1989 requested review from helinwang, putcn and typhoonzero October 19, 2017 12:46

update

cebfc51

typhoonzero reviewed Oct 19, 2017

View reviewed changes

update

8e3c74c

Yancey1989 requested a review from wangkuiyi October 20, 2017 08:16

helinwang reviewed Oct 20, 2017

View reviewed changes

typhoonzero previously approved these changes Oct 24, 2017

View reviewed changes

update

ae95f2a

Yancey1989 dismissed typhoonzero’s stale review via ae95f2a October 24, 2017 05:37

typhoonzero approved these changes Oct 24, 2017

View reviewed changes

Yancey1989 merged commit 7bbcd45 into PaddlePaddle:develop Oct 24, 2017

Yancey1989 deleted the experiment_design branch October 24, 2017 06:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto-scale experiment plan #419

auto-scale experiment plan #419

Yancey1989 commented Oct 19, 2017 •

edited

Loading

typhoonzero Oct 19, 2017

Yancey1989 Oct 19, 2017

typhoonzero Oct 19, 2017

putcn Oct 19, 2017

gongweibao Oct 20, 2017 •

edited

Loading

putcn Oct 20, 2017

helinwang Oct 20, 2017 •

edited

Loading

Yancey1989 Oct 20, 2017

Yancey1989 Oct 20, 2017

typhoonzero Oct 19, 2017

Yancey1989 Oct 19, 2017

Yancey1989 Oct 20, 2017

helinwang Oct 20, 2017 •

edited

Loading

Yancey1989 Oct 21, 2017

Yancey1989 Oct 21, 2017

helinwang Oct 23, 2017

Yancey1989 Oct 24, 2017

putcn commented Oct 20, 2017

helinwang commented Oct 23, 2017 •

edited

Loading

typhoonzero left a comment

typhoonzero Oct 24, 2017

Yancey1989 Oct 24, 2017

Yancey1989 commented Oct 24, 2017

typhoonzero left a comment


		## Before Starting The Experiment

		- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo.

auto-scale experiment plan #419

auto-scale experiment plan #419

Conversation

Yancey1989 commented Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gongweibao Oct 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Oct 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Oct 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

putcn commented Oct 20, 2017

helinwang commented Oct 23, 2017 • edited Loading

typhoonzero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 commented Oct 24, 2017

typhoonzero left a comment

Choose a reason for hiding this comment

Yancey1989 commented Oct 19, 2017 •

edited

Loading

gongweibao Oct 20, 2017 •

edited

Loading

helinwang Oct 20, 2017 •

edited

Loading

helinwang Oct 20, 2017 •

edited

Loading

helinwang commented Oct 23, 2017 •

edited

Loading