Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fluid API using python with guard. #6508

Closed
3 of 4 tasks
typhoonzero opened this issue Dec 12, 2017 · 6 comments
Closed
3 of 4 tasks

Implement fluid API using python with guard. #6508

typhoonzero opened this issue Dec 12, 2017 · 6 comments
Assignees

Comments

@typhoonzero
Copy link
Contributor

typhoonzero commented Dec 12, 2017

  • change current implement to listen_and_serv, send, recv op implementation.
  • implement python API with guard for listen_and_serv
  • build sample program using python with statement APIs.
  • update document about this.

According to https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/concurrent_programming.md#the-worker-program we need to implement similar API looks like:

Server side:

loss = define_model()
server = fluid.listen_and_serv()
with server.do():
    opt = fluid.optimizer.Adam()
    opt.minimize(loss)

Worker side:

loss = define_model()
params, grads = fluid.append_backward(loss)
splited = layers.split(params, grads)
with fluid.parallel_for(len(splited)) as iter:
    layers.send(splited["grad"][iter.idx])
with fluid.parallel_for(len(splited)) as iter:
    layers.recv(splited["param"][iter.idx])
layers.concat(splited["param"])

If we are using CSP model, the server side may look like:

loss = define_model()
params, grads = fluid.append_backward(loss)
param_ch = fluid.make_chan()
param_recved_ch = fluid.make_chan()
grad_ch = fluid.make_chan()
layers.split_to_chan(params, param_ch)
layers.split_to_chan(grads, grad_ch)

with fluid.go():
    layers.send(grad_ch)
with fluid.go():
    updated_param = layers.recv(param_ch)
    param_recved_ch.push(updated_param)
layers.concat(param_recved_ch)
@gongweibao gongweibao self-assigned this Jan 11, 2018
@typhoonzero typhoonzero changed the title Use while op inside recv_op as the event loop. Implement fluid API using python with guard. Jan 23, 2018
@Yancey1989
Copy link
Contributor

Yancey1989 commented Jan 23, 2018

For the trainer side, I think the order of send/recv is:

with fluid.parallel_for(len(splited)) as iter:
    layers.send(splited["grad"][iter.idx])
with fluid.parallel_for(len(splited)) as iter:
    layers.recv(splited["param"][iter.idx])

We need to execute Recv after all variables are sent.

And on the other hand, I saw #7706 also list the send/recv Op, shall we execute the Send/Recv Op in a goroutine?

@typhoonzero typhoonzero self-assigned this Jan 23, 2018
This was referenced Jan 23, 2018
@helinwang
Copy link
Contributor

"If we are using CSP model, the server side may look like:", do you mean the worker side?

I think layers.send(grad_ch) should be something like: layers.send(grad_ch.recv()), in this way the send is still sending a variable, not a channel.

layers.recv(param_ch) could be param_ch.send(layers.recv())

@helinwang
Copy link
Contributor

Btw, is the Python code for illustration only? I don't think we should expose send/recv OP to the user.

@typhoonzero
Copy link
Contributor Author

typhoonzero commented Jan 24, 2018

I think layers.send(grad_ch) should be something like: layers.send(grad_ch.recv()), in this way the send is still sending a variable, not a channel.

Good point, thank you.

Btw, is the Python code for illustration only? I don't think we should expose send/recv OP to the user.

I think we should expose send/recv/listen_and_serv as layers to users, so that fluid can be a "real" programming language.

For example, @dzhwinter talked yestoday that we may need a single server to merge all the trainers evaluation, this could be done by using these ops as layers.

@helinwang
Copy link
Contributor

helinwang commented Jan 24, 2018

I see, thanks. Does the user really care to merge all the trainers evaluation (maybe trainer-id==0's local evaluation is suffice), and write the Python code manually? I am a little worried about the Python binding part becomes something that no one actually uses, but just my 2 cents.

@typhoonzero
Copy link
Contributor Author

People may need to define their own distributed network, like in #7671. And, yes, we'd like users to use transpilers for simplicity in most cases. If we have those layers, we can also use them in the transpiler to simplify the transpiler's implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants