[AutoParallel] Support multi machine case for the visualize tool #59179

AndSonder · 2023-11-20T14:50:35Z

PR types

Others

PR changes

Others

Description

PR #58313 中实现了静态图模式下可视化流水并行时序图，本 PR 为工具提供多机模式下的支持。

多机模式下需要用户手动将多台机器上的数据拷贝到一个目录下并按照如下格式组织，下面以一个2机2卡的测试环境举例。

log 文件夹结构：

multi_machine_logs
├── machine0
│   ├── workerlog.0
│   └── workerlog.1
├── machine1
│   ├── workerlog.0
│   └── workerlog.1

运行时候添加 --multi_machine 命令

python python/paddle/distributed/auto_parallel/static/profiler_helper_static.py --devices 0,1 --log_dir /home/workspace/PaddleNLP/model_zoo/gpt-3/log_auto_6.7B_mp2pp4_st/multi_machine_logs/ --multi_machine

可视化效果如下：

依赖PR：

[AutoParallel] Visualize flow parallel timing diagram in static graph mode #58313

… add_profiler

…to add_profiler

… add_profiler

paddle-bot · 2023-11-20T14:50:40Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

From00

LGTM

…dlePaddle#59179) * merge from openvino master * add InterpreterRunTime() to record interpreter's run time * add profiler helper static to produce json file * add color map and support perfetto format * recover codes * control include env for gpu_timer.h * fix logic for profiler_helper_static.py * fix build error * fix build error * recover thirdparty * add flag control: not support new ir now * set auto_parallel_profiler flag to false * fix * add auto_parallel_profiler as command parameter * fix value name * support gettimeofday for win env * fix win build error * fix win build error * use job_type_to_id * Fixed repeatedly timing the same stream * add step line for timeline * add step timeline and fix logic when job overlap * update time record logic * fix bug when start profile start from none zero step * fix note * remove FLAGS_auto_parallel_profiler * use run config instead FLAGS_auto_parallelxx * fix color map logic * fix color map logic * fix bug when log step does not start from 0 * fix * fix * don't use set_enable_auto_parallel_profiler * fix bug * disable auto_parallel_profiler when not open flag by command line * fix bug * remove resettime * fix build bug * fix * remove set enable * fix build error * fix build error * fix build error * fix ci error * fix * fix run error * fix * fix * fix calculate_stream_timer logic * remove fluid head * fix build error * set default value for enable_job_schedule_profiler * support multi machine * fix load dir logic

AndSonder and others added 30 commits October 18, 2023 04:43

merge from openvino master

c514fbd

add InterpreterRunTime() to record interpreter's run time

0147f70

add profiler helper static to produce json file

6d1dc3d

add color map and support perfetto format

6f4f67c

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

14fd116

… add_profiler

recover codes

4d51610

control include env for gpu_timer.h

c70d9f9

fix logic for profiler_helper_static.py

ad0f17a

fix build error

e0442c6

fix build error

a8a37bb

recover thirdparty

a20e6ce

add flag control: not support new ir now

3e10a6d

set auto_parallel_profiler flag to false

59b425e

fix

ddc5038

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

14f6228

… add_profiler

add auto_parallel_profiler as command parameter

1dfc816

fix value name

9f271ef

support gettimeofday for win env

dabf964

fix win build error

6ad6f36

fix win build error

d58cc94

use job_type_to_id

e9886ae

Fixed repeatedly timing the same stream

282285b

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

3b0db0c

… add_profiler

add step line for timeline

fdc3f6d

add step timeline and fix logic when job overlap

1ceadc5

update time record logic

679cc39

Merge branch 'develop' into add_profiler

8953ae9

fix bug when start profile start from none zero step

1a04fea

fix note

e1c619d

Merge branch 'add_profiler' of https://github.com/AndSonder/Paddle in…

58c9f65

…to add_profiler

AndSonder added 21 commits November 9, 2023 13:55

fix bug

13b14d1

remove resettime

5bb55e1

fix build bug

f422b33

fix

ed5f7fc

remove set enable

718cf17

fix build error

f36b57b

fix build error

444b7a7

fix build error

f494916

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

28f089f

… add_profiler

fix ci error

a2b5988

fix

fb748d9

fix run error

aa5570d

fix

6b18e10

fix

f096253

fix calculate_stream_timer logic

560fb61

remove fluid head

bbb3071

fix build error

e15c19e

set default value for enable_job_schedule_profiler

989348c

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

10b84d8

… add_profiler

support multi machine

5cfd132

fix load dir logic

0bff0af

paddle-bot bot added the contributor External developers label Nov 20, 2023

AndSonder force-pushed the support_multi_multimachine branch from 34feef0 to 0bff0af Compare November 21, 2023 06:04

Merge branch 'develop' into support_multimachine

d32e4c5

AndSonder mentioned this pull request Nov 22, 2023

[WeeklyReports] 2023.11.08~2023.11.21 周报汇总 PFCCLab/Camp#77

Closed

21 tasks

From00 approved these changes Nov 25, 2023

View reviewed changes

From00 merged commit 6ab2fce into PaddlePaddle:develop Nov 25, 2023

AndSonder deleted the support_multi_multimachine branch April 23, 2024 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoParallel] Support multi machine case for the visualize tool #59179

[AutoParallel] Support multi machine case for the visualize tool #59179

AndSonder commented Nov 20, 2023 •

edited

Loading

paddle-bot bot commented Nov 20, 2023

From00 left a comment

[AutoParallel] Support multi machine case for the visualize tool #59179

[AutoParallel] Support multi machine case for the visualize tool #59179

Conversation

AndSonder commented Nov 20, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Nov 20, 2023

From00 left a comment

Choose a reason for hiding this comment

AndSonder commented Nov 20, 2023 •

edited

Loading