Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Custom Device]add run_check support for custom device #56318

Merged
merged 6 commits into from
Aug 17, 2023

Conversation

USTCKAY
Copy link
Contributor

@USTCKAY USTCKAY commented Aug 15, 2023

PR types

New features

PR changes

Others

Description

为Custom Device增加run_check支持。目前仅支持单一种类的Custom Device(如只安装了昇腾NPU)。测试结果如下

I0816 00:21:36.809013 46151 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:21:36.809063 46151 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:21:45.779037 46151 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:21:45.785408 46151 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:21:45.785624 46151 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:21:45.785661 46151 init.cc:245] CustomDevice: npu, visible devices count: 8
Running verify PaddlePaddle program ...
I0816 00:21:46.421669 46151 program_interpreter.cc:173] New Executor is Running.
I0816 00:21:56.070345 46151 interpreter_util.cc:602] Standalone Executor is Used.
PaddlePaddle works well on 1 npu.
I0816 00:22:03.654654 47214 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:22:03.654700 47214 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:22:03.654707 47218 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:22:03.654744 47218 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:22:03.679235 47224 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:22:03.679277 47224 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:22:03.679328 47222 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:22:03.679370 47222 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:22:03.733134 47226 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:22:03.733175 47226 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:22:03.745720 47216 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:22:03.745798 47216 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:22:03.774627 47228 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:22:03.774673 47228 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:22:03.844810 47220 init.cc:239] ENV [CUSTOM_DEVICE_ROOT]=/opt/py37env/lib/python3.7/site-packages/paddle_custom_device
I0816 00:22:03.844862 47220 init.cc:145] Try loading custom device libs from: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.845295 47224 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:23:01.845419 47214 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:23:01.845582 47216 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:23:01.845983 47218 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:23:01.845981 47226 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:23:01.846091 47222 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:23:01.846371 47220 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:23:01.847659 47228 custom_device.cc:1112] Successed in loading custom runtime in lib: /opt/py37env/lib/python3.7/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0816 00:23:01.853188 47222 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:23:01.853382 47222 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.853421 47222 init.cc:245] CustomDevice: npu, visible devices count: 8
I0816 00:23:01.855057 47216 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:23:01.855264 47216 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.855311 47216 init.cc:245] CustomDevice: npu, visible devices count: 8
I0816 00:23:01.856283 47214 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:23:01.856554 47214 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.856628 47214 init.cc:245] CustomDevice: npu, visible devices count: 8
I0816 00:23:01.856930 47224 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:23:01.857210 47224 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.857270 47224 init.cc:245] CustomDevice: npu, visible devices count: 8
I0816 00:23:01.861477 47228 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:23:01.861799 47228 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.861892 47228 init.cc:245] CustomDevice: npu, visible devices count: 8
I0816 00:23:01.869469 47218 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:23:01.871120 47218 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.871223 47218 init.cc:245] CustomDevice: npu, visible devices count: 8
I0816 00:23:01.871992 47226 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:23:01.872675 47226 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.872776 47226 init.cc:245] CustomDevice: npu, visible devices count: 8
I0816 00:23:01.876528 47220 custom_kernel.cc:76] Successed in loading 325 custom kernel(s) from loaded lib(s), will be used like native ones.
I0816 00:23:01.877274 47220 init.cc:157] Finished in LoadCustomDevice with libs_path: [/opt/py37env/lib/python3.7/site-packages/paddle_custom_device]
I0816 00:23:01.877414 47220 init.cc:245] CustomDevice: npu, visible devices count: 8
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
FLAGS(name='FLAGS_npu_storage_format', current_value=True, default_value=False)
=======================================================================
I0816 00:23:02.191779 47222 tcp_utils.cc:107] Retry to connect to 127.0.0.1:36736 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_npu_storage_format', current_value=True, default_value=False)
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
=======================================================================
I0816 00:23:02.230875 47214 tcp_utils.cc:181] The server starts to listen on IP_ANY:36736
I0816 00:23:02.231042 47214 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36736
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
FLAGS(name='FLAGS_npu_storage_format', current_value=True, default_value=False)
=======================================================================
I0816 00:23:02.243198 47224 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36736
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
FLAGS(name='FLAGS_npu_storage_format', current_value=True, default_value=False)
=======================================================================
I0816 00:23:02.250854 47228 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36736
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_npu_storage_format', current_value=True, default_value=False)
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
=======================================================================
I0816 00:23:02.257926 47218 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36736
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
FLAGS(name='FLAGS_npu_storage_format', current_value=True, default_value=False)
=======================================================================
I0816 00:23:02.268605 47216 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36736
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
FLAGS(name='FLAGS_npu_storage_format', current_value=True, default_value=False)
=======================================================================
I0816 00:23:02.296648 47226 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36736
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_npu_storage_format', current_value=True, default_value=False)
FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
=======================================================================
I0816 00:23:02.329967 47220 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36736
I0816 00:23:05.192014 47222 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36736
I0816 00:23:36.681679 47659 tcp_store.cc:273] receive shutdown event and so quit from MasterDaemon run loop
PaddlePaddle works well on 8 npus.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

@paddle-bot
Copy link

paddle-bot bot commented Aug 15, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

if use_custom is True:
import os

os.environ['PADDLE_DISTRI_BACKEND'] = "xccl"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分布式通过读取环境变量PADDLE_DISTRI_BACKEND设置backend,默认值为auto。这里检出使用custom device后手动设置backend为xccl,避免设置错误的backend。

elif 'npu' in device:
return core.get_custom_device_count('npu')
elif 'mlu' in device:
return core.get_custom_device_count('mlu')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里改成适用所有custom device的方式,不要用字符串进行判断,只能支持 npu 和 mlu 两种硬件类型。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -126,6 +130,8 @@ def _get_default_backend():
return 'bkcl'
elif 'cpu' in device:
return 'gloo'
elif 'npu' or 'mlu' in device:
return 'xccl'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,这里修改为支持所有通过custom device注册的硬件类型,不要用过字符串判断。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


if paddle.is_compiled_with_cuda():
use_cuda = _is_cuda_available()
elif paddle.is_compiled_with_xpu():
use_xpu = _is_xpu_available()
elif len(paddle.framework.core.get_all_custom_device_type()) == 1:
use_custom = _is_custom_device_available()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里259行的判断逻辑应该是 >0, 存在注册多个custom device的情况,另外 _is_custom_device_available 里面实现的逻辑和 259 行 elif的逻辑是同一个,判断条件重复,可以去掉一个。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

custom_device_name
)
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里默认只跑 device[0],判断一下如果有多个device注册,这里加点warning message提示下只对你device[0]进行检测

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@qili93 qili93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@USTCKAY USTCKAY changed the title [Custom Dice]add run_check support for custom device [Custom Device]add run_check support for custom device Aug 16, 2023
@ronny1996 ronny1996 merged commit 0ba4a23 into PaddlePaddle:develop Aug 17, 2023
@USTCKAY USTCKAY deleted the run_check_support_c_d branch August 17, 2023 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants