Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speed] feature/ParallelExecutor #8891

Closed

Conversation

tonyyang-svail
Copy link

@tonyyang-svail tonyyang-svail commented Mar 8, 2018

fix #8592

Profiling result:
script: as the example in this pr
command:

CUDA_VISIBLE_DEVICES=0             nvprof -f -o one.nvvp python parallel_executor_example.py --batch_size=32
CUDA_VISIBLE_DEVICES=0,1,2,3       nvprof -f -o four.nvvp python  parallel_executor_example.py --batch_size=32
Setting copy weights forward and backward merge gradient apply gradient
1 with nccl on bp / 250 / 5
4 with nccl on bp / 750(AllReduce takes about 63%) / 5

Save Model (to be implemented)

In the current implementation, the ParallelExecutor's constructor creates a base scope and n (n equals the number of GPUs) sub scopes, the model is replicated in each sub scope. The save model function cannot access the sub scopes created by the ParallelExecutor.

Proposed Solution

ParallelExecutor's constructor creates n-1 sub scopes. ParallelExecutor.run will take a scope parameter, which will be attached as another sub scope. In this way, the user can create a scope, use it for ParallelExecutor.run as well as for save model.

conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])

# drop = fluid.layers.dropout(x=conv5, dropout_prob=0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the commented code?

print_lock = Lock()


def save_print(*args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is save_print and pretty_id_indent necessary? They are not used in this PR.

variables. For example

1. NCCL communicator
1. Data reader(?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need "?", if so probably need to explain it.

@tonyyang-svail tonyyang-svail changed the title [WIP] Parallel executor [Speed] feature/ParallelExecutor Mar 12, 2018
```python
cost = your_neural_network()

opt = fluid.optimizer.SGDOptimizer(..., append_all_reduce=True)
Copy link
Contributor

@panyx0718 panyx0718 Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if append_all_reduce=True is not set, will it use other way to send gradients to each other? Or just raise exception?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it's not handled.

def __init__(self, learning_rate, regularization=None):
def __init__(self,
learning_rate,
global_step=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to put global step next to avoid breaking exiting users?

@@ -35,7 +36,11 @@ class Optimizer(object):
but need to use one of it's implementation.
"""

def __init__(self, learning_rate, regularization=None):
def __init__(self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments for the arguments?

@@ -53,6 +58,7 @@ def __init__(self, learning_rate, regularization=None):
# {accum_name : { paramter_name : accumulator_for_parameter, ...}, ...}
self._accumulators = defaultdict(lambda: dict())
self.helper = None
self.append_all_reduce = append_all_reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a public attribute?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be private.

core.init_nccl_com(self.scope, gpu_list)

def run(self,
program=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to run part of the program in multi-gpus and keep others in 1 gpu or cpu

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.


def run(self,
program=None,
feed=None,
Copy link
Contributor

@panyx0718 panyx0718 Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the design doc, you said feed won't be exposed?

learning_rate,
global_step=None,
regularization=None,
append_all_reduce=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this to be an implementation detail. Can we avoid exposing it to user?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

exe = fluid.ParallelExecutor(gpu_list=[0, 1])
```

## Design
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly describe the plan for model save/restore?

q = Queue(maxsize=len(self.executors))
for idx, exe in enumerate(self.executors):
cur_scope = self.scopes[idx]
t = Thread(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about doing multi-threading in Python for such important computation, but maybe I'm wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need an Executor written in c++ and wrap it in python, so the thread in CPU can use multiple cores.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero Right. A design of C++ ParallelExecutor will be submitted in a separate PR.

gpu_list=range(fluid.core.get_cuda_device_count()))

# Parameter initialization
exe.run(fluid.default_startup_program())
Copy link
Contributor

@panyx0718 panyx0718 Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, parameter initialization and other startup computations are also done in parallel N times?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@tonyyang-svail
Copy link
Author

A c++ implementation of ParallelExecutor will be submitted at #9035.

This PR will be closed.

Because it would be relatively easier to adapt multi-stream at the C++ level.

@chengduoZH chengduoZH added the parallel_exe parallel executor label Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallel_exe parallel executor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The problem of improving the performance of Parallel_Do
5 participants