-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] C++ implementation of parallel executor #9035
[WIP] C++ implementation of parallel executor #9035
Conversation
… for record io reader; use double buffer reader
|
||
for (auto& op : ctx->ops_) { | ||
// sgd should wait for allreduce finished | ||
for (auto& param2argu : op->Inputs()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of looping every input every time, perhaps the op can cache the param inputs and only wait for them
|
||
PADDLE_ENFORCE( | ||
cudaEventRecord(computation_event[argu], computation_stream)); | ||
PADDLE_ENFORCE(cudaStreamWaitEvent(all_reduce_stream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to block the next computation op. We should profile and see how much it hurts
@@ -86,6 +88,7 @@ class Scope { | |||
mutable std::unordered_map<std::string, Variable*> vars_; | |||
mutable std::list<Scope*> kids_; | |||
Scope const* parent_{nullptr}; | |||
std::vector<std::shared_ptr<Scope>> replicas_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little strange. why would the Scope keeps all the replicas? It's not a singleton?
@@ -457,12 +457,39 @@ class BatchNormGradKernel<platform::CPUDeviceContext, T> | |||
} | |||
}; | |||
|
|||
class BatchNormGradMaker : public framework::SingleGradOpDescMaker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a fix? Should it be checked-in separately?
if (!(ins[i].place() == dev_place)) { | ||
LOG(INFO) << "Copy " << out_arg_names[i] << " from " << ins[i].place() | ||
<< " to " << dev_place; | ||
framework::TensorCopy(ins[i], dev_place, out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this ensure different gpu get different data?
|
||
if not isinstance(program, Program): | ||
raise TypeError() | ||
if not isinstance(fetch_list, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict -> list?
std::unordered_set<std::string>* param_grads_; | ||
}; | ||
|
||
class MultiGPUExecutor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend it to be MultiCPUThreadExecutor??
command: CUDA_VISIBLE_DEVICES=3 python -m unittest test_parallel_executor.TestResnet
I am closing this pr since its alternative #9080 will be merged. |
DO NOT MERGE THIS PR!
This PR will serve a baseline to #9080. The main difference is that:
In this PR, one thread is bound to one GPU. Each thread launches all Ops sequentially in the computation stream and launches AllReduce in the io stream. CudaEvent is used for coordination between streams.
In #9080, a dependency parsing is used for scheduling the ready Ops to a thread pool.
machine: 250
test_script: test_parallel_executor.py in this PR
test_command:
CUDA_VISIBLE_DEVICES=3 python -m unittest test_parallel_executor.TestResnet
CUDA_VISIBLE_DEVICES=3,4,5,6 python -m unittest test_parallel_executor.TestResnet
model: SE_ResNeXt152
batch_size: 16 per GPU
model size: 1382651904
peak memory: 7351879168
20.7775 Instance per second
60.8804 Instance per second