Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add trainer.stop and fix a bug for train_by_parallel_executor #10762

Merged
merged 1 commit into from
May 18, 2018

Conversation

jacquesqiao
Copy link
Member

@jacquesqiao jacquesqiao commented May 18, 2018

fix: #10754

'''
if float(test_metrics[0]) < 20.0:
if isinstance(event, fluid.EndStepEvent):
if event.step == 10:
Copy link
Contributor

@sidgoyal78 sidgoyal78 May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this step for epoch? Or is it more like for iterations (different mini-batches across epochs)?
(Because we have epoch_id and step_id in the trainer code)

Copy link
Member Author

@jacquesqiao jacquesqiao May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for epoch_id in range(num_epochs):
event_handler(BeginEpochEvent(epoch_id))
for step_id, data in enumerate(reader()):
begin_event = BeginStepEvent(epoch_id, step_id)
event_handler(begin_event)
if begin_event.fetch_metrics:
metrics = exe.run(feed=data,
fetch_list=[
var.name
for var in self.train_func_outputs
])
else:
metrics = exe.run(feed=data, fetch_list=[])
event_handler(EndStepEvent(epoch_id, step_id, metrics))

From the code we can see that step_id is independent for each epoch, but we can also get epoch id from EndStepEvent.epoch_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually I was just confused about the naming. I was thinking that event.step should represent iterations. Probably, iterations should never counter back to 0. But in this case, we have event.step based on step_id.

However, the step_id is from 0 to number of mini-batches. And after each epoch the step_id is reset to 0.

So basically the point was that could be have a separate naming convention :)

for epoch_id in range(num_epochs):
self._train_by_any_executor(event_handler, pe, num_epochs,
reader)
self._train_by_any_executor(event_handler, pe, num_epochs, reader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the reason of getting rid of the for loop? Just curious 😁

Copy link
Member Author

@jacquesqiao jacquesqiao May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug, we will do iterate on num_epoches in _train_by_any_executor

def _train_by_any_executor(self, event_handler, exe, num_epochs, reader):
for epoch_id in range(num_epochs):
event_handler(BeginEpochEvent(epoch_id))
for step_id, data in enumerate(reader()):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@daming-lu daming-lu merged commit eb7d875 into PaddlePaddle:develop May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Simplified API] When accuracy reaches a threshold, user should be able to terminate the training process
3 participants