fix MultiGradientMachine train and infer #2595

jacquesqiao · 2017-06-25T05:19:05Z

Problem:

when training or infering with python v2 api, if trainer_count > 1 and call trainer.train or inferer.infer multiple times, the process will hang.

Reason:

when trainer_count > 1, paddle will use MultiGradientMachine and will start multiple worker threads to do the forward/backward work(thread number is trainer_count)

In v2 python API, trainer or inferer will call gradinet_machine.finish() after train/infer, this will stop the worker threads, when you call trianer.train or inferrer.infer the second time, there are no worker thread to handle the task, so it hangs there.

Fix:

Do not close worker threads when gradientMachine.finish() is called, close them in the destructor of gradientMachine.

typhoonzero · 2017-06-25T23:41:44Z

Just to confirm, will swig call c++'s defunctor when a python object is released?

reyoung · 2017-06-26T02:47:39Z

paddle/gserver/gradientmachines/MultiGradientMachine.cpp

@@ -326,12 +332,6 @@ void MultiGradientMachine::onPassEnd() {
  }
 }

-void MultiGradientMachine::finish() {


Why remove finish?

After move thread->stop() to destructor, finish() seems has no use.

Have add finish back

reyoung · 2017-06-26T02:48:35Z

paddle/gserver/gradientmachines/MultiGradientMachine.cpp

@@ -171,6 +171,12 @@ MultiGradientMachine::MultiGradientMachine(const ModelConfig& config,
  }
 }

+MultiGradientMachine::~MultiGradientMachine() {


Maybe in destructor, we should check whether MultiGradientMachine is finished or not?

If MultiGradientMachine is not finish, then invoke this->finish();

jacquesqiao · 2017-06-26T06:26:52Z

Just to confirm, will swig call c++'s defunctor when a python object is released?

@typhoonzero not sure, I will check it.

fix MultiGradientMachine train and infer

55684af

jacquesqiao requested review from reyoung, lcy-seso and juliecbd June 25, 2017 05:29

reyoung requested changes Jun 26, 2017

View reviewed changes

use GradientMachine::start and finish

9f05a0f

reyoung approved these changes Jun 27, 2017

View reviewed changes

jacquesqiao merged commit bf57345 into PaddlePaddle:develop Jun 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix MultiGradientMachine train and infer #2595

fix MultiGradientMachine train and infer #2595

jacquesqiao commented Jun 25, 2017 •

edited

Loading

typhoonzero commented Jun 25, 2017

reyoung Jun 26, 2017

jacquesqiao Jun 26, 2017

jacquesqiao Jun 26, 2017

reyoung Jun 26, 2017

jacquesqiao commented Jun 26, 2017

fix MultiGradientMachine train and infer #2595

fix MultiGradientMachine train and infer #2595

Conversation

jacquesqiao commented Jun 25, 2017 • edited Loading

Problem:

Reason:

Fix:

typhoonzero commented Jun 25, 2017

reyoung Jun 26, 2017

Choose a reason for hiding this comment

jacquesqiao Jun 26, 2017

Choose a reason for hiding this comment

jacquesqiao Jun 26, 2017

Choose a reason for hiding this comment

reyoung Jun 26, 2017

Choose a reason for hiding this comment

jacquesqiao commented Jun 26, 2017

jacquesqiao commented Jun 25, 2017 •

edited

Loading