-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add trainer.stop and fix a bug for train_by_parallel_executor #10762
Conversation
''' | ||
if float(test_metrics[0]) < 20.0: | ||
if isinstance(event, fluid.EndStepEvent): | ||
if event.step == 10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step for epoch? Or is it more like for iterations (different mini-batches across epochs)?
(Because we have epoch_id
and step_id
in the trainer code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle/python/paddle/fluid/trainer.py
Lines 289 to 302 in 54ae8e4
for epoch_id in range(num_epochs): | |
event_handler(BeginEpochEvent(epoch_id)) | |
for step_id, data in enumerate(reader()): | |
begin_event = BeginStepEvent(epoch_id, step_id) | |
event_handler(begin_event) | |
if begin_event.fetch_metrics: | |
metrics = exe.run(feed=data, | |
fetch_list=[ | |
var.name | |
for var in self.train_func_outputs | |
]) | |
else: | |
metrics = exe.run(feed=data, fetch_list=[]) | |
event_handler(EndStepEvent(epoch_id, step_id, metrics)) |
From the code we can see that step_id
is independent for each epoch, but we can also get epoch id
from EndStepEvent.epoch_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually I was just confused about the naming. I was thinking that event.step
should represent iterations. Probably, iterations should never counter back to 0. But in this case, we have event.step
based on step_id.
However, the step_id is from 0 to number of mini-batches. And after each epoch the step_id is reset to 0.
So basically the point was that could be have a separate naming convention :)
for epoch_id in range(num_epochs): | ||
self._train_by_any_executor(event_handler, pe, num_epochs, | ||
reader) | ||
self._train_by_any_executor(event_handler, pe, num_epochs, reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is the reason of getting rid of the for
loop? Just curious 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug, we will do iterate on num_epoches in _train_by_any_executor
Paddle/python/paddle/fluid/trainer.py
Lines 288 to 291 in 54ae8e4
def _train_by_any_executor(self, event_handler, exe, num_epochs, reader): | |
for epoch_id in range(num_epochs): | |
event_handler(BeginEpochEvent(epoch_id)) | |
for step_id, data in enumerate(reader()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
fix: #10754