[WIP] C++ implementation of parallel executor #9035

tonyyang-svail · 2018-03-13T23:55:35Z

DO NOT MERGE THIS PR!

This PR will serve a baseline to #9080. The main difference is that:

In this PR, one thread is bound to one GPU. Each thread launches all Ops sequentially in the computation stream and launches AllReduce in the io stream. CudaEvent is used for coordination between streams.

In #9080, a dependency parsing is used for scheduling the ready Ops to a thread pool.

machine: 250
test_script: test_parallel_executor.py in this PR
test_command:

CUDA_VISIBLE_DEVICES=3 python -m unittest test_parallel_executor.TestResnet
CUDA_VISIBLE_DEVICES=3,4,5,6 python -m unittest test_parallel_executor.TestResnet

model: SE_ResNeXt152
batch_size: 16 per GPU
model size: 1382651904
peak memory: 7351879168

20.7775 Instance per second
60.8804 Instance per second

… for record io reader; use double buffer reader

panyx0718 · 2018-03-27T01:51:05Z

paddle/fluid/framework/multi_gpu_executor.cc

+
+  for (auto& op : ctx->ops_) {
+    // sgd should wait for allreduce finished
+    for (auto& param2argu : op->Inputs()) {


Instead of looping every input every time, perhaps the op can cache the param inputs and only wait for them

panyx0718 · 2018-03-27T02:00:07Z

paddle/fluid/framework/multi_gpu_executor.cc

+
+          PADDLE_ENFORCE(
+              cudaEventRecord(computation_event[argu], computation_stream));
+          PADDLE_ENFORCE(cudaStreamWaitEvent(all_reduce_stream,


This seems to block the next computation op. We should profile and see how much it hurts

panyx0718 · 2018-03-27T02:22:46Z

paddle/fluid/framework/scope.h

@@ -86,6 +88,7 @@ class Scope {
  mutable std::unordered_map<std::string, Variable*> vars_;
  mutable std::list<Scope*> kids_;
  Scope const* parent_{nullptr};
+  std::vector<std::shared_ptr<Scope>> replicas_;


This seems a little strange. why would the Scope keeps all the replicas? It's not a singleton?

panyx0718 · 2018-03-27T02:25:11Z

paddle/fluid/operators/batch_norm_op.cc

@@ -457,12 +457,39 @@ class BatchNormGradKernel<platform::CPUDeviceContext, T>
  }
 };

+class BatchNormGradMaker : public framework::SingleGradOpDescMaker {


Is this a fix? Should it be checked-in separately?

panyx0718 · 2018-03-27T02:59:41Z

paddle/fluid/operators/read_op.cc

+      if (!(ins[i].place() == dev_place)) {
+        LOG(INFO) << "Copy " << out_arg_names[i] << " from " << ins[i].place()
+                  << " to " << dev_place;
+        framework::TensorCopy(ins[i], dev_place, out);


Does this ensure different gpu get different data?

panyx0718 · 2018-03-27T03:05:45Z

python/paddle/fluid/executor.py

-
-        if not isinstance(program, Program):
-            raise TypeError()
+        if not isinstance(fetch_list, dict):


dict -> list?

panyx0718 · 2018-03-27T03:09:07Z

paddle/fluid/framework/multi_gpu_executor.h

+  std::unordered_set<std::string>* param_grads_;
+};
+
+class MultiGPUExecutor {


Can we extend it to be MultiCPUThreadExecutor??

command: CUDA_VISIBLE_DEVICES=3 python -m unittest test_parallel_executor.TestResnet

tonyyang-svail · 2018-03-29T18:58:36Z

I am closing this pr since its alternative #9080 will be merged.

init commit

0621c32

tonyyang-svail requested a review from reyoung March 13, 2018 23:55

tonyyang-svail mentioned this pull request Mar 13, 2018

[Speed] feature/ParallelExecutor #8891

Closed

Yang Yang added 2 commits March 14, 2018 00:11

update readme

e67325c

delete param name

8f061e4

panyx0718 self-requested a review March 14, 2018 04:24

Yang Yang and others added 18 commits March 15, 2018 17:56

better name

a62d142

pass run

65c7ed5

switch to larger network

7fa64ee

use all device

003b165

fix compile

b2f4c5a

add broadcast

07ee125

add resnet

c3d6b86

Shrink batch_norm_grad's inputs

0760aaf

add vgg_bn_drop

8699e4e

merge 9299

4506d39

hook up data reader for multi-gpu executor example

f07c25e

add embedding

b343ce3

merge helin

0a6a552

take the device context improvement from reyoung; turn on thread safe…

5f1127c

… for record io reader; use double buffer reader

run multi gpu with recordio reader

8b9884b

add share comment

bb07417

change name

9e5d957

scope: add replicas, used for multi gpu executor

33ada99

panyx0718 reviewed Mar 27, 2018

View reviewed changes

Yang Yang and others added 5 commits March 27, 2018 21:51

add wait on executor

27d17e0

Add test parallel executor and transformer model from reyoung's PR

7aad021

test multi gpu executor in test_parallel_executor.py

069b726

command: CUDA_VISIBLE_DEVICES=3 python -m unittest test_parallel_executor.TestResnet

Merge remote-tracking branch 'pr/9035' into cpp_parallel_executor

75d89f6

make bn inplace in img_conv_group by default

924cada

Yang Yang added 3 commits March 28, 2018 22:00

add inplace attr to bn

799446b

merge append act inplace

e67070b

add in place

41f1a87

tonyyang-svail closed this Mar 29, 2018

chengduoZH added the parallel_exe parallel executor label Apr 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] C++ implementation of parallel executor #9035

[WIP] C++ implementation of parallel executor #9035

tonyyang-svail commented Mar 13, 2018 •

edited

Loading

panyx0718 Mar 27, 2018

panyx0718 Mar 27, 2018

panyx0718 Mar 27, 2018

panyx0718 Mar 27, 2018

panyx0718 Mar 27, 2018

panyx0718 Mar 27, 2018

panyx0718 Mar 27, 2018

tonyyang-svail commented Mar 29, 2018

[WIP] C++ implementation of parallel executor #9035

[WIP] C++ implementation of parallel executor #9035

Conversation

tonyyang-svail commented Mar 13, 2018 • edited Loading

panyx0718 Mar 27, 2018

Choose a reason for hiding this comment

panyx0718 Mar 27, 2018

Choose a reason for hiding this comment

panyx0718 Mar 27, 2018

Choose a reason for hiding this comment

panyx0718 Mar 27, 2018

Choose a reason for hiding this comment

panyx0718 Mar 27, 2018

Choose a reason for hiding this comment

panyx0718 Mar 27, 2018

Choose a reason for hiding this comment

panyx0718 Mar 27, 2018

Choose a reason for hiding this comment

tonyyang-svail commented Mar 29, 2018

tonyyang-svail commented Mar 13, 2018 •

edited

Loading