Add a processing pool for the modifiers #34

jelmervdl · 2023-09-11T16:29:21Z

Attempt to work around #33

Modified the modifier interface to take in batches so they can benefit from batch processing. None of the modifiers has this implemented yet though.
Run the modifier chain in multiple subprocesses in parallel (defaults to max(min(cpu_count,8),1), chunks of 16, which makes about sense with the default batch size of 100.
Changing --batch-size or --chunk-size changes the output. Changing --workers does not.

Todo:

Actually figure out how to implement a timing regression test in Github Actions

Because we don't know in which order and which process each line is going to be processed?

jelmervdl · 2023-09-12T15:10:07Z

There still seems to be one source of indeterministic behaviour somewhere , for me test_full_enzh and test_prefix_augment somewhat consistently seem to fail.

jelmervdl · 2023-09-12T16:31:08Z

Note to self: somehow the chunks in the pool seem to get mixed. In the end-to-end test exactly 13 lines got shuffled around even when running with --no-shuffle.

Relevant for #21, but also current implementation was causing bugs

XapaJIaMnu

LGTM

XapaJIaMnu · 2023-09-13T08:34:05Z

src/opustrainer/modifiers/retokenize.py

-        return '\t'.join((new_src, new_trg, format_alignments(remapped_pairs)))
-
+    def __call__(self, batch:List[str]) -> Iterable[str]:
+        for line in batch:


Do we expect this to fail now? Or is the exception here added because exceptions no longer bubble up?

Previously the exception was caught at the Trainer level when it applied the modifier line for line. The Trainer then logged it and skipped that line entirely.

Since we're now submitting batches to modifiers, I've changed these non-critical errors to do the same thing: just skip the line and log about it. Otherwise we'd skip the entire batch (or chunk that was assigned to a specific worker)

I've made sure that logging and raising exceptions still works inside the ModifierWorker processes.

So long story short: this modification keeps the behaviour as it is now on the main branch.

graemenail

LGTM

(was going to ask a stupid question about determinism and if they're workers complete out of order, but then saw you covered that)

jelmervdl added 5 commits September 11, 2023 18:28

Add a processing pool for the modifiers

04ce335

Seed per line

ebffaec

Because we don't know in which order and which process each line is going to be processed?

Clean-up ModifierPool/Worker

099942a

Make modifiers accept batches, improve logging scenario

d72e9b9

Regenerate end-to-end tests

06b3f11

jelmervdl linked an issue Sep 12, 2023 that may be closed by this pull request

Sentence throughput regressions #33

Closed

Hahaha I'm an idiot

36af241

jelmervdl added 3 commits September 12, 2023 18:54

Fix dealing with modifiers returning more or fewer lines in a batch

72b0837

Relevant for #21, but also current implementation was causing bugs

Don't apply TitleCase modifier all the time

cb5bca2

Regenerate end-to-end output

d49988c

jelmervdl marked this pull request as ready for review September 12, 2023 16:55

jelmervdl requested review from XapaJIaMnu and graemenail September 12, 2023 16:59

XapaJIaMnu approved these changes Sep 13, 2023

View reviewed changes

graemenail approved these changes Sep 13, 2023

View reviewed changes

jelmervdl merged commit 678deac into main Sep 13, 2023

jelmervdl deleted the multiprocessing branch September 13, 2023 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a processing pool for the modifiers #34

Add a processing pool for the modifiers #34

jelmervdl commented Sep 11, 2023 •

edited

Loading

jelmervdl commented Sep 12, 2023

jelmervdl commented Sep 12, 2023

XapaJIaMnu left a comment

XapaJIaMnu Sep 13, 2023

jelmervdl Sep 13, 2023

jelmervdl Sep 13, 2023

graemenail left a comment

Add a processing pool for the modifiers #34

Add a processing pool for the modifiers #34

Conversation

jelmervdl commented Sep 11, 2023 • edited Loading

jelmervdl commented Sep 12, 2023

jelmervdl commented Sep 12, 2023

XapaJIaMnu left a comment

Choose a reason for hiding this comment

XapaJIaMnu Sep 13, 2023

Choose a reason for hiding this comment

jelmervdl Sep 13, 2023

Choose a reason for hiding this comment

jelmervdl Sep 13, 2023

Choose a reason for hiding this comment

graemenail left a comment

Choose a reason for hiding this comment

jelmervdl commented Sep 11, 2023 •

edited

Loading