-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto Parallel] Compatible new comm library upgrade #56604
[Auto Parallel] Compatible new comm library upgrade #56604
Conversation
fluid operator support new comm library
…uce_scatter and c_scatter.
1. remove useless loggings. 2. Fix conditional compilation for HIP. 3. Fix problems of test_pass_generation_pipeline.py. It calls paddle.distributed.init_parallel_env() at first, then auto.Engine calls _init_comm(), which will calls process_group.instantiate(). However, init_parallel_env() will call paddle.distributed.barrier(), it will call CreateNCCLEnvCache and create corresponding NCCLCommContext. But dev_id is not set, as a result, NCCLCommContext's dev_ctx is not initialized.
… static_compat_upgrade
will be submitted in another PR.
… pure_static_compat_upgrade
… pure_static_compat_upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -67,7 +77,29 @@ void CommContextManager::CreateNCCLCommContext( | |||
|
|||
auto nccl_comm_context = | |||
std::make_unique<NCCLCommContext>(rank, size, nccl_id); | |||
auto& comm_context_manager = CommContextManager::GetInstance(); | |||
|
|||
if (CommContextManager::device_id != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use dev_ctx from global dev_ctx pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCLCommContext
holds std::unique_ptr<phi::GPUContext> dev_ctx
, so we create a new object out of DeviceContextPool
control.
… pure_static_compat_upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* for verify fluid operator support new comm library * u * u * u * compatiable new comm library upgrade for c_allgather, c_reduce, c_reduce_scatter and c_scatter. * Remove useless comments in process_group.py * Polish code style. * Fix some problems. * Remove use fluid api in phi comm_context_manager. * Add PPADDLE_WITH_CUDA and PADDLE_WITH_NCCL micro judgement. * Fix bug of HIP architecture. * Fix some problems. 1. remove useless loggings. 2. Fix conditional compilation for HIP. 3. Fix problems of test_pass_generation_pipeline.py. It calls paddle.distributed.init_parallel_env() at first, then auto.Engine calls _init_comm(), which will calls process_group.instantiate(). However, init_parallel_env() will call paddle.distributed.barrier(), it will call CreateNCCLEnvCache and create corresponding NCCLCommContext. But dev_id is not set, as a result, NCCLCommContext's dev_ctx is not initialized. * Fix some problems. * Polish code. * Polish code. * Revert compatiable upgrade for communication operators. Their upgrades will be submitted in another PR. * Remove StaticTCPStore. * Remove useless modification. * Remove useless set_cuda_device_id. * Polish code. * Remove fluid header files in phi files. * Remove useless comments. * Fix problems of hip arch. * Fix some problems. * Polish code. * Polish code style. --------- Co-authored-by: hitywt <yuwentao126@126.com>
PR types
Others
PR changes
Others
Description
Pcard-70448
Upgrade fluid operators to be compatible with new communication library. Old communication library is used defaultly. Switch to new communication library by export
FLAGS_dynamic_static_unified_comm=1
. We can use new communication library in default as soon as all communication operators are upgraded compatibly.