Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a processing pool for the modifiers #34

Merged
merged 9 commits into from
Sep 13, 2023
Merged

Add a processing pool for the modifiers #34

merged 9 commits into from
Sep 13, 2023

Conversation

jelmervdl
Copy link
Contributor

@jelmervdl jelmervdl commented Sep 11, 2023

Attempt to work around #33

  • Modified the modifier interface to take in batches so they can benefit from batch processing. None of the modifiers has this implemented yet though.
  • Run the modifier chain in multiple subprocesses in parallel (defaults to max(min(cpu_count,8),1), chunks of 16, which makes about sense with the default batch size of 100.
  • Changing --batch-size or --chunk-size changes the output. Changing --workers does not.

Todo:

  • Actually figure out how to implement a timing regression test in Github Actions

@jelmervdl jelmervdl linked an issue Sep 12, 2023 that may be closed by this pull request
@jelmervdl
Copy link
Contributor Author

There still seems to be one source of indeterministic behaviour somewhere , for me test_full_enzh and test_prefix_augment somewhat consistently seem to fail.

@jelmervdl
Copy link
Contributor Author

Note to self: somehow the chunks in the pool seem to get mixed. In the end-to-end test exactly 13 lines got shuffled around even when running with --no-shuffle.

@jelmervdl jelmervdl marked this pull request as ready for review September 12, 2023 16:55
Copy link
Contributor

@XapaJIaMnu XapaJIaMnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return '\t'.join((new_src, new_trg, format_alignments(remapped_pairs)))

def __call__(self, batch:List[str]) -> Iterable[str]:
for line in batch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this to fail now? Or is the exception here added because exceptions no longer bubble up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the exception was caught at the Trainer level when it applied the modifier line for line. The Trainer then logged it and skipped that line entirely.

Since we're now submitting batches to modifiers, I've changed these non-critical errors to do the same thing: just skip the line and log about it. Otherwise we'd skip the entire batch (or chunk that was assigned to a specific worker)

I've made sure that logging and raising exceptions still works inside the ModifierWorker processes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long story short: this modification keeps the behaviour as it is now on the main branch.

Copy link
Contributor

@graemenail graemenail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

(was going to ask a stupid question about determinism and if they're workers complete out of order, but then saw you covered that)

@jelmervdl jelmervdl merged commit 678deac into main Sep 13, 2023
@jelmervdl jelmervdl deleted the multiprocessing branch September 13, 2023 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sentence throughput regressions
3 participants