-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Semi-auto]Add Shard/Replicate/Partial in DistTensor #58930
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
paddle/fluid/pybind/eager_utils.cc
Outdated
#else | ||
PADDLE_THROW(platform::errors::Unavailable( | ||
"Placements to PyObject is not supported in the current " | ||
"PaddlePaddle, please recompile and installPaddlePaddle with the option " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
install PaddlePaddle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
paddle/phi/common/reduce_type.h
Outdated
@@ -13,6 +13,7 @@ | |||
// limitations under the License. | |||
|
|||
#pragma once | |||
#include <ostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个头文件没用着
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done thx
ReduceType reduce_type_; | ||
}; | ||
|
||
using Placements = std::vector<std::shared_ptr<Placement>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是不是可以放到DistTensorMeta类里
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
} | ||
|
||
bool operator==(const Placement& other) const override { | ||
const Shard* other_shard = dynamic_cast<const Shard*>(&other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是不是在placement里加一个shard_axis方法然后直接访问比较好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个属于shard特有的,和Partial中的reduce_type_一样。
.def(py::init([](int64_t dim) { | ||
return std::make_shared<phi::distributed::Shard>(dim); | ||
})) | ||
.def("get_dim", &phi::distributed::Shard::get_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
张量的维度 和 Mesh 的维度所用的名字是否要做区分? 都叫 dim 的话,感觉容易混淆。 e.g.:axis,dim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的dim应该就是只tensor的dim,所以对齐一致。
dist_attr.set_dims_mapping(dist_tensor_meta_.dim_mapping()); | ||
dist_attr.mark_annotated("process_mesh"); | ||
dist_attr.mark_annotated("dims_mapping"); | ||
dist_attr_ = dist_attr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉后面 dist_tensor_meta_ 和 dist_attr_ 只能留一个做数据成员。不然会有数据同步问题,静半踩了挺多坑的。
比如 dist_attr 永远是临时变脸,只能通过函数 get_dist_attr(dist_tensor_meta_) 返回
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,是的。计划后续Disttensor将只会拥有DistTensorMeta和DenseTensor指针,其他的都将删掉。但是改动较大,暂时拆分pr合入。
@@ -121,12 +146,14 @@ class DistTensor final | |||
private: | |||
friend class ReshardFunction; | |||
|
|||
// The global dimensions(shape) | |||
// The global dimensions(shape), will move to DistTensorMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉 tesnor shape,Mesh, dims mapping 都会用 dims。。。。 这个名字后面可能要讨论讨论
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,这个名字并没有统一。
private: | ||
std::shared_ptr<const ProcessMesh> process_mesh_; | ||
Placements placements_; | ||
std::shared_ptr<const DenseTensorMeta> tensor_meta_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要存整个DenseTensorMeta吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要保存dtype,strides,layerout等信息,后续用于推导的cache,基本上和DenseTensorMeta是相同的。
self._mesh = dist.ProcessMesh([0, 1], dim_names=["x"]) | ||
|
||
def run_test_placements(self): | ||
self.placements = [core.Replicate(), core.Replicate()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要不要直接把这几个类import到paddle.distributed下面,不用core所谓前缀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,目前这个单测是一个临时单测。不涉及API改动,后续要整体改动API,在更新这个单测。
public: | ||
virtual ~Placement() = default; | ||
|
||
virtual bool is_shard(std::optional<int> dim = std::nullopt) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placement基类会实际存在对象吗?要不要直接用纯虚函数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不会存在实际对象。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
private: | ||
std::shared_ptr<const ProcessMesh> process_mesh_; | ||
Placements placements_; | ||
std::shared_ptr<const DenseTensorMeta> tensor_meta_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里用share_ptr的原因是?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
少一次copy,同时tensormeta可能被其他地方使用。生命周期和dtensor一致
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for set_tests_properties(test_dist_tensor_api PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 100)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
Others
Description
DistTensor通过Shard/Replicate/Partial构造函数,同时增加SRP转化为dim_mapping。后续将废弃通过TensorDistAttr构造DistTensor。动半中,DistAttr在推导转换出现,SRP在Reshard中出现。
Pcard-73145