-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix mask-tying to sequence length #660
Conversation
This is mostly a quick smoke test (haven't tested locally yet --- running into issues with local dev). I think this needs a unit test and probably an inspection of benchmark performance to ensure the new branching doesn't impact speed. |
a888d65
to
c29e380
Compare
1243c74
to
735b617
Compare
I think this is ready for review. It's a pretty small change, benchmarks look OK, and tests pass (locally). cc @danthe3rd @blefaudeux |
Hi @erip and thanks for your contribution :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
As discussed in the previous comments, we really need to do some cleanup and refactoring on those APIs. But this is already a good improvement, thanks!
What does this PR do?
Fixes #655
This PR allows a class-level causal attention mask for fixed sequence-length tasks; otherwise it creates a new mask on-the-fly in the forward pass to disentangle the mask from sequence length which can vary between batches in tasks like MT.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.