Add trainers for the pretraining and backtranslation tasks #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was able to get both the pretraining and backtranslation tasks working using graphemes, as described in Appendix G in the paper. It required some model tweaks to allow freezing the encoder & specifying a target mask when the modalities are different. I added beam search as optional in the generation loop since it seems to outperform temperature sampling for the backtranslation task.
I also fixed an issue with loading checkpoints and to allow restoring checkpoints without the optimizer state. This makes both resuming training runs and fine-tuning from the pretrained model possible.
This might be too big of a change to take, but I thought it might be useful for others! Let me know if you'd prefer me to break it up into smaller changes if that's helpful.
Here's an example of a sample generation using the (severely undertrained 🙈, in my case) models: