Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-scale experiment plan #419

Merged
merged 4 commits into from
Oct 24, 2017

Conversation

Yancey1989
Copy link
Collaborator

@Yancey1989 Yancey1989 commented Oct 19, 2017

Fixed #395
Fixed #413

## Before Starting The Experiment

- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo.
- We have 240 CPU cores and 80 GPU cards totally.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要说我们现在有多少资源。可以列一个表格,最终实验的环境填写到这个表格里

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以不写真实的资源数,这里其实也只是为了计算实验数据时使用的。

- Start a training job(jobA) with 2~100 trainer instances(2 pservers, 1 master), the trainers will be scaled immediately to use the maximum free resources in the cluster.
- Start another training job(jobB) with 50~100 trainer instances(2 pservers, 1 master), there is no enough resource, the job will wait for the adequacy of the resource.

### With auto-scaling Job
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case可以分成多组,分别测试不同的场景:

  1. 本地环境测试
  2. 离线集群
  3. 离线集群CPU/GPU混合调度(最大化GPU利用率)
  4. 在离线集群混部(在线服务高优先级)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在线服务高优先级这个可以启动一些HPA的Nginx的服务, 然后在外面用压力测试的方式query Nginx的服务, 然后验证下Nginx的pod数量增加, 同时ML的pod数量被减小. @helinwang

Copy link
Collaborator

@gongweibao gongweibao Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解我们还没有优先级的概念, Nginx的pod数量增加应该是手动完成的,现在还做不到压力增大Nginxpod数量自动增加吧?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们可以用hpa来自动控制Nginx数量吗

Copy link
Collaborator

@helinwang helinwang Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HPA确实可以实现自动控制Nginx数量,也比较好设定。另外,只要Nginx压力增大,CPU和Mem的使用量都会增加,我们能监测到,不一定需要HPA哈。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以用HPA来设置CPU的thredhold来达到控制Nginx数量的目的。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero 增加了第2,4 两种场景,第1和3感觉没有什么具体的数值可以用来做对比,或者只是介绍性的描述?

- At least 4 kubernetes nodes, each node should have 2 GPU cards at least.
- Dataset prepared to multiple files with the RecordIO format.

## Experiment Metric
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不确定是否叫Metric? 对比的维度怎么表示?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者这里列一个表格更清楚一些?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每个Test Case下加了一个表格,实际实验结果可以增加采样点用图表来表示。

1. Job-A will be scaled down and job-A and job-B will run in the cluster at the same time, and they will use the maximum free resources.

- Experiment metrics
1. Compare the **CPU utils** with auto-scaling training job and general training job.
Copy link
Collaborator

@helinwang helinwang Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can add cluster wide overall CPU / GPU utils.

$ kubectl describe nodes | grep -A 2 -e "^\\s*CPU Requests"
  CPU Requests	CPU Limits	Memory Requests	Memory Limits
  ------------	----------	---------------	-------------
  3865m (96%)	3600m (90%)	2760Mi (47%)	2770Mi (47%)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with CPU utils, but maybe there is no difference between the CPU and GPU resource for the auto-scaling feature? How about we only use CPU as the computing resource?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者我们可以按 @typhoonzero#419 (comment) 这里提到的第三种场景,测试CPU和GPU混合调度的场景,但感觉这可能不属于auto-scaling的特性范围了。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要不测的时候CPU和GPU utils都测一下(应该就是一行命令的事情),用不用GPU utils最后再决定?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU utils的数据采集可能会稍微复杂一些(需要扫描Pod,或者读取influxDB中的数据来获取,Kubernetes API无法直接取到)不过可以都测一下。

@putcn
Copy link

putcn commented Oct 20, 2017

just found kube-dashboard has cpu and memory usage chart, that might be used in our report
screen shot 2017-10-20 at 4 25 16 pm

@helinwang
Copy link
Collaborator

helinwang commented Oct 23, 2017

Discussed with @putcn, perhaps we need:

  1. Check in all code for the experiment (easy for other people outside of PaddlePaddle team to reproduce an easy for us to reproduce in the future).
  2. Collect time series data for plotting graph (e.g, cluster utilization v.s time).

typhoonzero
typhoonzero previously approved these changes Oct 24, 2017
Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT except one comment


## Before Starting The Experiment

- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the demos in book should be tested!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Yancey1989
Copy link
Collaborator Author

@helinwang
Completely agree! Easy to reproduce the experiment is important.

Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Yancey1989 Yancey1989 merged commit 7bbcd45 into PaddlePaddle:develop Oct 24, 2017
@Yancey1989 Yancey1989 deleted the experiment_design branch October 24, 2017 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants