Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoParallel] Support multi machine case for the visualize tool #59179

Merged
merged 63 commits into from
Nov 25, 2023

Conversation

AndSonder
Copy link
Contributor

@AndSonder AndSonder commented Nov 20, 2023

PR types

Others

PR changes

Others

Description

PR #58313 中实现了静态图模式下可视化流水并行时序图,本 PR 为工具提供多机模式下的支持。

多机模式下需要用户手动将多台机器上的数据拷贝到一个目录下并按照如下格式组织,下面以一个2机2卡的测试环境举例。

log 文件夹结构:

multi_machine_logs
├── machine0
│   ├── workerlog.0
│   └── workerlog.1
├── machine1
│   ├── workerlog.0
│   └── workerlog.1

运行时候添加 --multi_machine 命令

python python/paddle/distributed/auto_parallel/static/profiler_helper_static.py --devices 0,1 --log_dir /home/workspace/PaddleNLP/model_zoo/gpt-3/log_auto_6.7B_mp2pp4_st/multi_machine_logs/ --multi_machine

可视化效果如下:

image

依赖PR:

Copy link

paddle-bot bot commented Nov 20, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Nov 20, 2023
Copy link
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@From00 From00 merged commit 6ab2fce into PaddlePaddle:develop Nov 25, 2023
28 checks passed
SecretXV pushed a commit to SecretXV/Paddle that referenced this pull request Nov 28, 2023
…dlePaddle#59179)

* merge from openvino master

* add InterpreterRunTime() to record interpreter's run time

* add profiler helper static to produce json file

* add color map and support perfetto format

* recover codes

* control include env for gpu_timer.h

* fix logic for profiler_helper_static.py

* fix build error

* fix build error

* recover thirdparty

* add flag control: not support new ir now

* set auto_parallel_profiler flag to false

* fix

* add auto_parallel_profiler as command parameter

* fix value name

* support gettimeofday for win env

* fix win build error

* fix win build error

* use job_type_to_id

* Fixed repeatedly timing the same stream

* add step line for timeline

* add step timeline and fix logic when job overlap

* update time record logic

* fix bug when start profile start from none zero step

* fix note

* remove FLAGS_auto_parallel_profiler

* use run config instead FLAGS_auto_parallelxx

* fix color map logic

* fix color map logic

* fix bug when log step does not start from 0

* fix

* fix

* don't use set_enable_auto_parallel_profiler

* fix bug

* disable auto_parallel_profiler when not open flag by command line

* fix bug

* remove resettime

* fix build bug

* fix

* remove set enable

* fix build error

* fix build error

* fix build error

* fix ci error

* fix

* fix run error

* fix

* fix

* fix calculate_stream_timer logic

* remove fluid head

* fix build error

* set default value for enable_job_schedule_profiler

* support multi machine

* fix load dir logic
@AndSonder AndSonder deleted the support_multi_multimachine branch April 23, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants