-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/multigpu #4838
Feature/multigpu #4838
Conversation
paddle/framework/operator.h
Outdated
@@ -290,6 +290,15 @@ class ExecutionContext { | |||
return device_context_; | |||
} | |||
|
|||
//! Get a input which has multiple variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment is not accurate, all our input/output is stored in a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/framework/operator.h
Outdated
const std::vector<std::string>& Inputs(const std::string& name) const { | ||
return op_.Inputs(name); | ||
} | ||
//! Get an output which has multiple variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,17 @@ | |||
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this file used for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build an operator independent module, which may be used for Multiexecutor
or some more module.
paddle/operators/nccl_op.cc
Outdated
|
||
auto x_dims = ctx->GetInputsDim("X"); | ||
|
||
// std::string reduction = ctx->Attrs().Get<std::string>("reduction"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these checks needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add reduction Done.
paddle/operators/nccl_op.cc
Outdated
AddOutput("Out", "The output of Reduce op"); | ||
AddAttr<int>("root", | ||
"root gpu of the parameter. if not set(-1). hashed by name.") | ||
.SetDefault(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a const value to represent -1, such as kInvalidGPUId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/nccl_op.cu
Outdated
|
||
auto* comm = ctx.Input<Communicator>("Communicator"); | ||
|
||
auto stream = reinterpret_cast<const platform::CUDADeviceContext&>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要cast么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returned value is DeviceContext, the abstract class doesn't contain a stream interface.
paddle/operators/nccl_op.cu
Outdated
auto ins_names = ctx.Inputs("X"); | ||
std::hash<std::string> hasher; | ||
for (size_t i = 0; i < ins.size(); ++i) { | ||
if (root == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace -1 with a const value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/framework/operator.h
Outdated
@@ -290,6 +290,16 @@ class ExecutionContext { | |||
return device_context_; | |||
} | |||
|
|||
//! Get variables vector with same input name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
Get actual name vector for this input.
is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/nccl_op.cc
Outdated
std::vector<int> gpus = Attr<std::vector<int>>("gpus"); | ||
PADDLE_ENFORCE(!gpus.empty(), "Attr(gpus) should not be empty."); | ||
platform::Communicator *comm = | ||
scope.FindVar(name)->GetMutable<platform::Communicator>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a check
if (scope.FindVar(name) == nullptr) {...}
because this op doesn't have infershape to ensure the output is there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/nccl_op.cu
Outdated
See the License for the specific language governing permissions and | ||
limitations under the License. */ | ||
|
||
#define EIGEN_USE_GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not use Eigen here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
auto outs = ctx.MultiOutput<LoDTensor>("Out"); | ||
|
||
std::string reduction = ctx.Attr<std::string>("reduction"); | ||
ncclRedOp_t reduction_op_ = ncclSum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are ncclSum
and ncclMax
and ncclProd
come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL have all the four type of operations. http://docs.nvidia.com/deeplearning/sdk/nccl-api/ncclapidoc.html#ncclredop_t
This operator can be used with "reduction" attribute to indicate the operation.
paddle/operators/nccl_op.cu
Outdated
} else if (reduction == "ncclProd") { | ||
reduction_op_ = ncclProd; | ||
} else { | ||
PADDLE_ENFORCE(false, "Invalid reduction. default ncclSum."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PADDLE_ENFORCE(false, ...)
to PADDLE_THROW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/nccl_op_test.cu
Outdated
See the License for the specific language governing permissions and | ||
limitations under the License. */ | ||
|
||
#define EIGEN_USE_GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/nccl_op_test.cu
Outdated
op2->SetInput("X", {"st"}); | ||
op2->SetInput("Communicator", {"comm"}); | ||
op2->SetOutput("Out", {"rt"}); | ||
op2->SetAttr("root", {kRoot}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line should be
op2->SetAttr("root", kRoot);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
template <class T> | ||
void PerThreadProgram(int gpu_id, const f::OpDescBind &op_desc, | ||
f::Scope *scope) { | ||
std::unique_lock<std::mutex> lk(mu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why here need a lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we will call GetMutable interface, which is not a thread safe function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, each PerThreadProgram will run on an independent scope and place, In this situation, do we still have thread safe problem? Just want to make sure~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we did have.
If I moved this lock guard, I really get a segment fault. In my view,
T* GetMutable() {
if (!IsType<T>()) {
holder_.reset(new PlaceholderImpl<T>(new T()));
}
return static_cast<T*>(holder_->Ptr());
}
use the global placement new
to allocate memory for type T, at this stage, we do not have any guard on it.
The scope and place only seperate the allocated pointer from each other, so our scope hierachy is only take effect to user's program built on the scope.
If we want the thread safe feature, we need a lock on the new T
. I think.
paddle/operators/nccl_op_test.cu
Outdated
} | ||
} | ||
|
||
// ncclAReduceOp with desc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncclAReduceOp => ncclReduceOp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greate Job!
NCCL is a group of MPI-like primitives library, which can be used in synchronized values between multi-GPU cards.
The simplest implementation of #3769, currently, we only support multi-GPU cards run the same topology and synchronized parameters before parameter optimizing.
To minimize the review job, this PR only implement AllReduce operator, which will be used frequently in synchronizing parameters/gradients between GPU cards.
We will leave the other operators Gather/Bcast in the future work.
To support the NCCL library in current refactorization stage, Here is the brief plan.
Every GPU should run the same graph/blocks, and only can be synchronized at specific parameters/gradients.
This will be supported if the performance is a bottleneck.