-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON output from diarization now includes sentences. Optimized senten… #3897
Conversation
…ce generation code Signed-off-by: Jure Demsar <demsarjure@gmail.com>
start_point_str = datetime.fromtimestamp(start_point - datetime_offset).strftime(time_str)[:-4] | ||
end_point_str = datetime.fromtimestamp(end_point - datetime_offset).strftime(time_str)[:-4] | ||
|
||
# print time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable "replace_time" is necessary for streaming diarization case.
In streaming diarization scenario, the end time timestamp should be updated everytime the diarization system gets another output.
Please do the following:
- revive the argument "replace_time"
- revive the if statement with the argument "replace_time"
Except this part, I checked that all the other functions are working well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi tango4j!
Time replacement is now handled inside the make_json_output
function (line 529). If the speaker is the same as the previous one we do not have a new sentence but the end time gets tweaked. The change is then also reflected in the printed out time (string_out
variable that gets printed out to the .txt file) automatically. There is no need to have the replace_time
argument there.
I might be wrong since I do not know how exactly the streaming pipeline will work. I can revive the replace_time
argument, but in the current codebase it would not make any sense. I am writing this so we are on the same page. Please advise about how to proceed here.
Thanks, Jure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another remark is that the current implementation was executed as proposed by Nithin Rao in #3791.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@demsarjure
it's my bad that I didn't get to check make_json_output.
It seems now, in revised code, "sentences" item takes over start-end time generation for transcript and json item "sentences" at the same time, which reduces the redundancy.
Aside from this, could you do a minor cosmetic change by removing the one word comment?
For example, "sentences", "prepare", "color?", "print time?"
Other comments are helpful, but these one word comments seem somewhat unnecessary (since the variable names are self-explanatory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @demsarjure, ran and checked the output, looks good.
Just a minor change requested, please reflect, and should be good to go.
Also as taejin mentioned you may remove the comments added that look obvious.
Signed-off-by: Jure Demsar <demsarjure@gmail.com>
@tango4j @nithinraok I revised everything in accordance with your comments. Thanks! |
@demsarjure I think you missed above comment #3897 (comment) |
Signed-off-by: Jure Demsar <demsarjure@gmail.com>
Yes, I did not see that comment, sorry about that :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @demsarjure great work. LGTM
Thank you @demsarjure . This is a thoughtful and useful contribution. |
…ce generation code
Signed-off-by: Jure Demsar demsarjure@gmail.com
What does this PR do ?
Diarization with ASR JSON output now also includes sentences. This is a continuation of #3791, which I accidentally closed when cleaning up the repo.
Collection: [Note which collection this PR will affect]
ASR
Changelog
Diarization with ASR JSON output now also includes sentences.
Usage
You can use the
examples/speaker_tasks/diarization/offline_diarization.py
script and check the output JSON file (.../pred_rttms/uniq_name.json
).Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
@titu1994, @redoctopus, @jbalam-nv, or @okuchaiev
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
When implementing I used an approach that changes the original codebase as little as possible. In other words, I tried to be non-invasive. After reviewing the existing code for sentences (
string_out
) preparation inmake_json_output()
of/nemo/collections/asr/parts/utils/diarization_utils.py
I feel like the whole process could be done much more efficiently and elegantly. Currently, the sentence preparation is appending words to a string where the beginning of the string are the time stamps. When the ending time stamp needs to be corrected the code then uses string splitting to extract and replace/correct the end time value. This is quite clunky and inefficient. It would be much more elegant if time stamps and the sentences were stored and built within a dictionary and then printed out at the end.