-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix MultiGradientMachine train and infer #2595
fix MultiGradientMachine train and infer #2595
Conversation
Just to confirm, will swig call c++'s defunctor when a python object is released? |
@@ -326,12 +332,6 @@ void MultiGradientMachine::onPassEnd() { | |||
} | |||
} | |||
|
|||
void MultiGradientMachine::finish() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove finish
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After move thread->stop() to destructor, finish() seems has no use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have add finish back
@@ -171,6 +171,12 @@ MultiGradientMachine::MultiGradientMachine(const ModelConfig& config, | |||
} | |||
} | |||
|
|||
MultiGradientMachine::~MultiGradientMachine() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in destructor, we should check whether MultiGradientMachine is finished
or not?
If MultiGradientMachine is not finish, then invoke this->finish();
@typhoonzero not sure, I will check it. |
fix: #2534 #2565
Problem:
when training or infering with python v2 api, if trainer_count > 1 and call trainer.train or inferer.infer multiple times, the process will hang.
Reason:
when trainer_count > 1, paddle will use MultiGradientMachine and will start multiple worker threads to do the forward/backward work(thread number is trainer_count)
In v2 python API, trainer or inferer will call gradinet_machine.finish() after train/infer, this will stop the worker threads, when you call trianer.train or inferrer.infer the second time, there are no worker thread to handle the task, so it hangs there.
Fix:
Do not close worker threads when gradientMachine.finish() is called, close them in the destructor of gradientMachine.