-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a processing pool for the modifiers #34
Conversation
Because we don't know in which order and which process each line is going to be processed?
There still seems to be one source of indeterministic behaviour somewhere , for me |
Note to self: somehow the chunks in the pool seem to get mixed. In the end-to-end test exactly 13 lines got shuffled around even when running with |
Relevant for #21, but also current implementation was causing bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return '\t'.join((new_src, new_trg, format_alignments(remapped_pairs))) | ||
|
||
def __call__(self, batch:List[str]) -> Iterable[str]: | ||
for line in batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect this to fail now? Or is the exception here added because exceptions no longer bubble up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the exception was caught at the Trainer level when it applied the modifier line for line. The Trainer then logged it and skipped that line entirely.
Since we're now submitting batches to modifiers, I've changed these non-critical errors to do the same thing: just skip the line and log about it. Otherwise we'd skip the entire batch (or chunk that was assigned to a specific worker)
I've made sure that logging and raising exceptions still works inside the ModifierWorker processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So long story short: this modification keeps the behaviour as it is now on the main branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(was going to ask a stupid question about determinism and if they're workers complete out of order, but then saw you covered that)
Attempt to work around #33
max(min(cpu_count,8),1)
, chunks of 16, which makes about sense with the default batch size of 100.--batch-size
or--chunk-size
changes the output. Changing--workers
does not.Todo: