-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token level timestamps for long-form generation in Whisper #29148
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it looks good to me, but I'd like to defer the approval to our Whisper expert @sanchit-gandhi :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zucchini-nlp - sounds good to me aligning the word-level timestamps on a per-segment level (rather than on the full sequence-level). We'll probably need to propagate some changes forward to the ASR pipeline so that it's compatible with these generation related changes, but we can do this in a follow-up PR
self.assertListEqual(tokens_shape, token_timestamps_shape) | ||
|
||
# fmt: off | ||
EXPECTED_OUTPUT = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these values taken from the original repo? Or inspected from Transformers Whisper and deemed to be correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inspected from Transformers Whisper and deemed to be correct. I am not sure how the tests for short-form timestamps were written, so I just took an audio and copied model outputs.
Also cc @ArthurZucker for Whisper-related timestamps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a small end to end test with explicit outputs, will help us in the long run! Otherwise LGTM
|
||
for segment, exp_segment in zip(generate_outputs["segments"][0], EXPECTED_OUTPUT): | ||
self.assertTrue(torch.allclose(segment["token_timestamps"], exp_segment)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a small test like test_return_timestamps_in_preprocess
in /Users/arthurzucker/Work/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.py
just to make sure we have something explicit like
'chunks': [
{'text': ' Conquered', 'timestamp': (0.5, 1.2)},
{'text': ' returned', 'timestamp': (1.2, 1.64)},
{'text': ' to', 'timestamp': (1.64, 1.84)},
{'text': ' its', 'timestamp': (1.84, 2.02)},
{'text': ' place', 'timestamp': (2.02, 2.28)},
{'text': ' amidst', 'timestamp': (2.28, 2.8)},
{'text': ' the', 'timestamp': (2.8, 2.98)},
{'text': ' tents.', 'timestamp': (2.98, 3.48)},
],
```!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done!
Sorry, a mistag. Got a review from Arthur, so we merged it :) |
What does this PR do?
Continuation of PR #28984. Adds token level timestamps for long-form generation. The previous PR had a quite different of way to add timestamps, specifically by calling
extract_timestamps
for each segment and each batch separately. I believe, it can be done in one batch, and then divided into segments the same way sequences are divided.The final timestamps are already aligned with the total length, so there is not need to add start_time for each segment. Although, I am not sure if that is what we want to have, so I can remove this "total duration alignment" is needed.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sanchit-gandhi
@patrickvonplaten
@gante ?