Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/multigpu #4838

Merged
merged 48 commits into from
Oct 29, 2017
Merged

Feature/multigpu #4838

merged 48 commits into from
Oct 29, 2017

Conversation

dzhwinter
Copy link
Contributor

@dzhwinter dzhwinter commented Oct 16, 2017

NCCL is a group of MPI-like primitives library, which can be used in synchronized values between multi-GPU cards.

The simplest implementation of #3769, currently, we only support multi-GPU cards run the same topology and synchronized parameters before parameter optimizing.

To minimize the review job, this PR only implement AllReduce operator, which will be used frequently in synchronizing parameters/gradients between GPU cards.
We will leave the other operators Gather/Bcast in the future work.

To support the NCCL library in current refactorization stage, Here is the brief plan.

  • feature/multigpu nccl support design doc. Multigpu Feature #3769
    Every GPU should run the same graph/blocks, and only can be synchronized at specific parameters/gradients.
  • third_party NCCL library integration. "add nccl cmake enforce" #4818
  • Construct NCCL communicators. To demonstrate communicator implement correctly, support NCCL AllReduce primitive. Feature/multigpu #4838
  • AllGather/Bcast any other NCCL primitives.

This will be supported if the performance is a bottleneck.

  • hand-write allreduce/bcast primitive routines.

@@ -290,6 +290,15 @@ class ExecutionContext {
return device_context_;
}

//! Get a input which has multiple variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is not accurate, all our input/output is stored in a vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

const std::vector<std::string>& Inputs(const std::string& name) const {
return op_.Inputs(name);
}
//! Get an output which has multiple variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,17 @@
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this file used for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build an operator independent module, which may be used for Multiexecutor or some more module.


auto x_dims = ctx->GetInputsDim("X");

// std::string reduction = ctx->Attrs().Get<std::string>("reduction");
Copy link
Member

@jacquesqiao jacquesqiao Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these checks needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add reduction Done.

AddOutput("Out", "The output of Reduce op");
AddAttr<int>("root",
"root gpu of the parameter. if not set(-1). hashed by name.")
.SetDefault(-1);
Copy link
Member

@jacquesqiao jacquesqiao Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a const value to represent -1, such as kInvalidGPUId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


auto* comm = ctx.Input<Communicator>("Communicator");

auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要cast么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned value is DeviceContext, the abstract class doesn't contain a stream interface.

auto ins_names = ctx.Inputs("X");
std::hash<std::string> hasher;
for (size_t i = 0; i < ins.size(); ++i) {
if (root == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace -1 with a const value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -290,6 +290,16 @@ class ExecutionContext {
return device_context_;
}

//! Get variables vector with same input name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

Get actual name vector for this input.

is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::vector<int> gpus = Attr<std::vector<int>>("gpus");
PADDLE_ENFORCE(!gpus.empty(), "Attr(gpus) should not be empty.");
platform::Communicator *comm =
scope.FindVar(name)->GetMutable<platform::Communicator>();
Copy link
Member

@jacquesqiao jacquesqiao Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a check

if (scope.FindVar(name) == nullptr) {...}

because this op doesn't have infershape to ensure the output is there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

See the License for the specific language governing permissions and
limitations under the License. */

#define EIGEN_USE_GPU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not use Eigen here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

auto outs = ctx.MultiOutput<LoDTensor>("Out");

std::string reduction = ctx.Attr<std::string>("reduction");
ncclRedOp_t reduction_op_ = ncclSum;
Copy link
Member

@jacquesqiao jacquesqiao Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are ncclSum and ncclMax and ncclProd come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL have all the four type of operations. http://docs.nvidia.com/deeplearning/sdk/nccl-api/ncclapidoc.html#ncclredop_t

This operator can be used with "reduction" attribute to indicate the operation.

} else if (reduction == "ncclProd") {
reduction_op_ = ncclProd;
} else {
PADDLE_ENFORCE(false, "Invalid reduction. default ncclSum.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PADDLE_ENFORCE(false, ...) to PADDLE_THROW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

See the License for the specific language governing permissions and
limitations under the License. */

#define EIGEN_USE_GPU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

op2->SetInput("X", {"st"});
op2->SetInput("Communicator", {"comm"});
op2->SetOutput("Out", {"rt"});
op2->SetAttr("root", {kRoot});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line should be

op2->SetAttr("root", kRoot);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

template <class T>
void PerThreadProgram(int gpu_id, const f::OpDescBind &op_desc,
f::Scope *scope) {
std::unique_lock<std::mutex> lk(mu);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why here need a lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we will call GetMutable interface, which is not a thread safe function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, each PerThreadProgram will run on an independent scope and place, In this situation, do we still have thread safe problem? Just want to make sure~

Copy link
Contributor Author

@dzhwinter dzhwinter Oct 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we did have.
If I moved this lock guard, I really get a segment fault. In my view,

  T* GetMutable() {
    if (!IsType<T>()) {
      holder_.reset(new PlaceholderImpl<T>(new T()));
    }
    return static_cast<T*>(holder_->Ptr());
  }

use the global placement new to allocate memory for type T, at this stage, we do not have any guard on it.
The scope and place only seperate the allocated pointer from each other, so our scope hierachy is only take effect to user's program built on the scope.

If we want the thread safe feature, we need a lock on the new T. I think.

}
}

// ncclAReduceOp with desc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ncclAReduceOp => ncclReduceOp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greate Job!

@dzhwinter dzhwinter merged commit 833d0ad into PaddlePaddle:develop Oct 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants