-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speed] feature/ParallelExecutor #8891
[Speed] feature/ParallelExecutor #8891
Conversation
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0]) | ||
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0]) | ||
|
||
# drop = fluid.layers.dropout(x=conv5, dropout_prob=0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the commented code?
print_lock = Lock() | ||
|
||
|
||
def save_print(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is save_print
and pretty_id_indent
necessary? They are not used in this PR.
doc/design/parallel_executor.md
Outdated
variables. For example | ||
|
||
1. NCCL communicator | ||
1. Data reader(?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need "?", if so probably need to explain it.
```python | ||
cost = your_neural_network() | ||
|
||
opt = fluid.optimizer.SGDOptimizer(..., append_all_reduce=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if append_all_reduce=True is not set, will it use other way to send gradients to each other? Or just raise exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it's not handled.
def __init__(self, learning_rate, regularization=None): | ||
def __init__(self, | ||
learning_rate, | ||
global_step=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to put global step next to avoid breaking exiting users?
@@ -35,7 +36,11 @@ class Optimizer(object): | |||
but need to use one of it's implementation. | |||
""" | |||
|
|||
def __init__(self, learning_rate, regularization=None): | |||
def __init__(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments for the arguments?
@@ -53,6 +58,7 @@ def __init__(self, learning_rate, regularization=None): | |||
# {accum_name : { paramter_name : accumulator_for_parameter, ...}, ...} | |||
self._accumulators = defaultdict(lambda: dict()) | |||
self.helper = None | |||
self.append_all_reduce = append_all_reduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a public attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be private.
core.init_nccl_com(self.scope, gpu_list) | ||
|
||
def run(self, | ||
program=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to run part of the program in multi-gpus and keep others in 1 gpu or cpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.
|
||
def run(self, | ||
program=None, | ||
feed=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the design doc, you said feed won't be exposed?
learning_rate, | ||
global_step=None, | ||
regularization=None, | ||
append_all_reduce=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this to be an implementation detail. Can we avoid exposing it to user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
exe = fluid.ParallelExecutor(gpu_list=[0, 1]) | ||
``` | ||
|
||
## Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly describe the plan for model save/restore?
q = Queue(maxsize=len(self.executors)) | ||
for idx, exe in enumerate(self.executors): | ||
cur_scope = self.scopes[idx] | ||
t = Thread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about doing multi-threading in Python for such important computation, but maybe I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think need an Executor
written in c++ and wrap it in python, so the thread in CPU can use multiple cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero Right. A design of C++ ParallelExecutor
will be submitted in a separate PR.
gpu_list=range(fluid.core.get_cuda_device_count())) | ||
|
||
# Parameter initialization | ||
exe.run(fluid.default_startup_program()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, parameter initialization and other startup computations are also done in parallel N times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
A c++ implementation of ParallelExecutor will be submitted at #9035. This PR will be closed. Because it would be relatively easier to adapt multi-stream at the C++ level. |
fix #8592
Profiling result:
script: as the example in this pr
command:
Save Model (to be implemented)
In the current implementation, the ParallelExecutor's constructor creates a base scope and n (n equals the number of GPUs) sub scopes, the model is replicated in each sub scope. The save model function cannot access the sub scopes created by the ParallelExecutor.
Proposed Solution
ParallelExecutor's constructor creates n-1 sub scopes.
ParallelExecutor.run
will take a scope parameter, which will be attached as another sub scope. In this way, the user can create a scope, use it forParallelExecutor.run
as well as for save model.