-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058
Conversation
Does this support multi language or just English? |
Excited! Will this support multiple speaker labelling or will it just mark speaker turns? |
Hi @Harith163 and @JianbangZ:
Both of these are doable I think, but are a little more involved and honestly depends on how the project evolves. For multilingual - I think its easiest done by OpenAI themselves since ultimately that boils down to a reasonably multilingual finetuning dataset, and I'm pretty sure all released Whisper models had a final finetuning stage. I'd say clustering has less dependencies and is a bit more tractable. I will sketch a rough plan for that once a few immediate things are done. You can take a look at the immediate roadmap over at https://github.com/akashmjn/tinydiarize/tree/main#roadmap. |
In fact @ggerganov I notice that you've already implemented C-means by hand in cpp here #130 😅 . Once I free up a little, I'll try running some clustering experiments over on the python repo. In the meantime if you are interested, this is the best method out there NME-SC:
|
Yes :) Felt like doing some experiments (I cannot guarantee correctness of that implementation) Btw, will be reviewing the PR over the weekend. Adding a diarization flag should be easy |
Sounds good! For the last two points on my checklist - for now, i'll wait for your review. I've left I think it should just be clear to the user that this is an experimental feature and requires using a specific |
I synced latests |
Excited to see this PR merged. Noticed that this PR doesn't yet support the word-level timestamp flag. I wanted to flag that for consideration as Word level timestamps are quite helpful when building applications that show diarization output. |
This should be ready to merge now. Please take a look at my changes and let me know if you agree. The most important change is that I added Also, you now have to add the $ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -tdrz
main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...
[00:00:00.000 --> 00:00:03.800] Okay Houston, we've had a problem here. [SPEAKER_TURN]
[00:00:03.800 --> 00:00:06.200] This is Houston. Say again please. [SPEAKER_TURN]
[00:00:06.200 --> 00:00:08.260] Uh Houston we've had a problem.
[00:00:08.260 --> 00:00:11.320] We've had a main beam up on a volt. [SPEAKER_TURN]
[00:00:11.320 --> 00:00:13.820] Roger main beam interval. [SPEAKER_TURN]
[00:00:13.820 --> 00:00:15.100] Uh uh [SPEAKER_TURN]
[00:00:15.100 --> 00:00:18.020] So okay stand, by thirteen we're looking at it. [SPEAKER_TURN]
[00:00:18.020 --> 00:00:25.740] Okay uh right now uh Houston the uh voltage is uh is looking good um.
[00:00:27.620 --> 00:00:29.940] And we had a a pretty large bank or so. Here is without it: $ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin
main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:03.760] Okay Houston, we've had a problem here.
[00:00:03.760 --> 00:00:08.340] Uh Houston we've had a problem.
[00:00:08.340 --> 00:00:11.320] We've had a main beam up on a volt.
[00:00:11.320 --> 00:00:13.760] Roger main beam interval.
[00:00:13.760 --> 00:00:17.960] So okay stand, by thirteen we're looking at it.
[00:00:17.960 --> 00:00:25.740] Okay uh right now uh Houston the uh voltage is uh is looking good um.
[00:00:27.620 --> 00:00:29.940] And we had a a pretty large bank or so. Here is word-level timestamps with speaker turn detection: $ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -ml 1 -sow -tdrz
main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...
[00:00:00.000 --> 00:00:00.060]
[00:00:00.060 --> 00:00:00.500] Okay
[00:00:00.500 --> 00:00:01.340] Houston,
[00:00:01.340 --> 00:00:01.850] we've
[00:00:01.850 --> 00:00:02.160] had
[00:00:02.160 --> 00:00:02.260] a
[00:00:02.260 --> 00:00:02.990] problem
[00:00:02.990 --> 00:00:03.800] here. [SPEAKER_TURN]
[00:00:03.800 --> 00:00:04.030] This
[00:00:04.030 --> 00:00:04.140] is
[00:00:04.140 --> 00:00:04.710] Houston.
[00:00:04.710 --> 00:00:04.880] Say
[00:00:04.880 --> 00:00:05.170] again
[00:00:05.170 --> 00:00:06.200] please. [SPEAKER_TURN]
[00:00:06.200 --> 00:00:06.340] Uh
[00:00:06.340 --> 00:00:06.850] Houston
[00:00:06.850 --> 00:00:07.210] we've
[00:00:07.210 --> 00:00:07.430] had
[00:00:07.430 --> 00:00:07.530] a
[00:00:07.530 --> 00:00:08.260] problem.
[00:00:08.260 --> 00:00:08.770] We've
[00:00:08.770 --> 00:00:09.080] had
[00:00:09.080 --> 00:00:09.180] a
[00:00:09.180 --> 00:00:09.610] main
[00:00:09.610 --> 00:00:10.000] beam
[00:00:10.000 --> 00:00:10.200] up
[00:00:10.200 --> 00:00:10.400] on
[00:00:10.400 --> 00:00:10.500] a
[00:00:10.500 --> 00:00:11.320] volt. [SPEAKER_TURN]
[00:00:11.320 --> 00:00:11.840] Roger
[00:00:11.840 --> 00:00:12.250] main
[00:00:12.250 --> 00:00:12.740] beam
[00:00:12.740 --> 00:00:13.820] interval. [SPEAKER_TURN]
[00:00:13.820 --> 00:00:15.080] Uh
[00:00:15.080 --> 00:00:15.100] uh [SPEAKER_TURN]
[00:00:15.100 --> 00:00:15.230] So
[00:00:15.230 --> 00:00:15.500] okay
[00:00:15.500 --> 00:00:15.970] stand,
[00:00:15.970 --> 00:00:16.100] by
[00:00:16.100 --> 00:00:16.660] thirteen
[00:00:16.660 --> 00:00:16.980] we're
[00:00:16.980 --> 00:00:17.460] looking
[00:00:17.460 --> 00:00:17.610] at
[00:00:17.610 --> 00:00:18.020] it. [SPEAKER_TURN]
[00:00:18.020 --> 00:00:18.570] Okay
[00:00:18.570 --> 00:00:18.840] uh
[00:00:18.840 --> 00:00:19.530] right
[00:00:19.530 --> 00:00:19.940] now
[00:00:19.940 --> 00:00:20.210] uh
[00:00:20.210 --> 00:00:21.170] Houston
[00:00:21.170 --> 00:00:21.580] the
[00:00:21.580 --> 00:00:21.850] uh
[00:00:21.850 --> 00:00:22.810] voltage
[00:00:22.810 --> 00:00:23.080] is
[00:00:23.080 --> 00:00:23.400] uh
[00:00:23.400 --> 00:00:23.730] is
[00:00:23.730 --> 00:00:24.810] looking
[00:00:24.810 --> 00:00:25.440] good
[00:00:25.440 --> 00:00:25.740] um.
[00:00:27.620 --> 00:00:27.670]
[00:00:27.670 --> 00:00:27.840] And
[00:00:27.840 --> 00:00:27.980] we
[00:00:27.980 --> 00:00:28.210] had
[00:00:28.210 --> 00:00:28.270] a
[00:00:28.270 --> 00:00:28.340] a
[00:00:28.340 --> 00:00:28.780] pretty
[00:00:28.780 --> 00:00:29.150] large
[00:00:29.150 --> 00:00:29.440] bank
[00:00:29.440 --> 00:00:29.580] or
[00:00:29.580 --> 00:00:29.940] so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments relating to some tricky token ID stuff
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
Enables tinydiarize models ggerganov/whisper.cpp#1058
@karolszafranski I think no need any special settings, set |
…dio via tinydiarize (ggerganov#1058) * add HuggingFace mirror to download ggml model * support tdrz via simple hack overriding solm tokens * fix incorrect translate/transcribe token_ids that are not static const * add apollo 13 sample for tdrz demo * render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token * extend whisper_segment with speaker_turn_next field and save in json output * fix failing go build * slipped in some python syntax whoops * whisper : finalize tinydiarize support (add flag + fixes) * whisper : tdrz support for word-level timestamps (respect max_len) * java : try to fix tests after adding tdrz_enable flag * main : remove TODO leftover * java : fix params order list after adding "tdrz_enable" * whisper : fix solm and add nosp token * main : print tinydiarize help --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…dio via tinydiarize (ggerganov#1058) * add HuggingFace mirror to download ggml model * support tdrz via simple hack overriding solm tokens * fix incorrect translate/transcribe token_ids that are not static const * add apollo 13 sample for tdrz demo * render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token * extend whisper_segment with speaker_turn_next field and save in json output * fix failing go build * slipped in some python syntax whoops * whisper : finalize tinydiarize support (add flag + fixes) * whisper : tdrz support for word-level timestamps (respect max_len) * java : try to fix tests after adding tdrz_enable flag * main : remove TODO leftover * java : fix params order list after adding "tdrz_enable" * whisper : fix solm and add nosp token * main : print tinydiarize help --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
i'm not sure if this is expected, but with i wasn't seeing this behavior with |
…dio via tinydiarize (ggerganov#1058) * add HuggingFace mirror to download ggml model * support tdrz via simple hack overriding solm tokens * fix incorrect translate/transcribe token_ids that are not static const * add apollo 13 sample for tdrz demo * render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token * extend whisper_segment with speaker_turn_next field and save in json output * fix failing go build * slipped in some python syntax whoops * whisper : finalize tinydiarize support (add flag + fixes) * whisper : tdrz support for word-level timestamps (respect max_len) * java : try to fix tests after adding tdrz_enable flag * main : remove TODO leftover * java : fix params order list after adding "tdrz_enable" * whisper : fix solm and add nosp token * main : print tinydiarize help --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Is there a way to use this with coreml models?
|
It also happens with the small model, on its own or when pushed via ">>" prompt. Unfortunately, for the life of me I cannot combine it with my other prompt which resulted with proper quote-unquote behavior., i.e.
And quotes only happen when using -oved GPU [unfortunately it hallucinates a lot], where -oved CPU is much likely to trigger ">>" diarizations on its own. |
Hello! I was wondering - how does the integration work with the because I was running - it through the binary and from the server - and it seemed the diarization output was missing. Example: ./main -f ../audio/multi.wav -m ./models/ggml-small.en-tdrz.bin -tdrz --print-colors
# output
[00:00:00.080 --> 00:00:04.820] Let's go down. So your sister's going off How. old is she? [SPEAKER_TURN]
[00:00:04.820 --> 00:00:08.620] She's twenty five. [SPEAKER_TURN]
[00:00:08.620 --> 00:00:12.560] Alright. And is she going to go to do a job or is she's gonna travel? [SPEAKER_TURN]
[00:00:12.560 --> 00:00:19.940] Um she's going to work when she's there and do like bits of jobs and then move around at the same time. [SPEAKER_TURN]
[00:00:19.940 --> 00:00:22.520] So is she's goin
Same example using the server ./server -m models/ggml-small.en-tdrz.bin -tdrz -pc -debug
curl 127.0.0.1:8080/inference \
-H "Content-Type: multipart/form-data" \
-F file="@../audio/multi.wav" \
-F response_format="json" \
-F tinydiarize=true
Output {"text":" Mm. Okay.\n So your sister's going off How. old is she?\n She's twenty five.\n Alright. And is she going to go to do a job or is she's gonna travel?\n Um she's going to work when she's there and do like bits of jobs and then move around at the same time.\n So she's going straight to Australia?\n Um no first she's going to Thailand.\n And then she's going to Australia.\n And then move somewhere and then in America.\n Brilliant. So if she's bought one of these year t tickets you can go around the world f in a year or something Is. that what she's done with these airline tickets yeah, Yeah? So would you like to travel?\n Yeah.\n Mm-hmm.\n That's a good a reason though. Yeah. Actually I think it probably is because I mean I know it sounds straight forward but you can sort of add E_s and A_s and things on the end of things and it normally sounds right anyway. We've got a Spanish girl working with us at the moment so, So this is a a two year course now is, it G_C_S_E_s?\n Yeah. It's from year ten to year E_ el" }
Tried all of these formats and the same thing -
No issues if it is not supported - was just wondering if it was possible because the docs mentioned we can pass -tdrz to the server so was wondering if I was doing anything wrong! Cheers |
Hey guys, is this functionality coming to the larger models or could we compile it ourselves? Thank you so much |
…dio via tinydiarize (ggerganov#1058) * add HuggingFace mirror to download ggml model * support tdrz via simple hack overriding solm tokens * fix incorrect translate/transcribe token_ids that are not static const * add apollo 13 sample for tdrz demo * render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token * extend whisper_segment with speaker_turn_next field and save in json output * fix failing go build * slipped in some python syntax whoops * whisper : finalize tinydiarize support (add flag + fixes) * whisper : tdrz support for word-level timestamps (respect max_len) * java : try to fix tests after adding tdrz_enable flag * main : remove TODO leftover * java : fix params order list after adding "tdrz_enable" * whisper : fix solm and add nosp token * main : print tinydiarize help --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
As discussed in #64, this PR adds experimental support for local diarization (marking of speaker turns) via integration of checkpoints from this project https://github.com/akashmjn/tinydiarize/tree/main.
This is an early functional prototype done for the
small.en
models.@ggerganov - this should be functionally done save for the last two points on the checklist, for which i'd appreciate some comments on the right way to expose this.
(also please excuse my C++ , I haven't written a lot of it, so this is heavily copilot-assisted 😉 )
Example usage
After running the above, you should see this:
JSON output contains an extra
speaker_turn_next
field for each segment with this information.Example JSON output
Checklist:
--diarize
flagSome terminology context for the last two points: this is technically not complete diarization yet, but speaker segmentation https://www.perplexity.ai/search/d01e6743-d2dc-4f5e-b5c2-2bf2212068f7?s=u (which can be thought of as local diarization).
Also technically the stereo audio input used by the current
--diarize
flag is already diarized (as it is separated into individual channels), so the naming isn't strictly consistent here either?