Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Not for merge] Diarization workflow with SpeechBrain #1031

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

desh2608
Copy link
Collaborator

This workflow shows how we can use SpeechBrain x-vectors + sklearn agglomerative clustering to perform a crude speaker diarization. This can be used on top of the whisper workflow to obtain speaker-attributed transcripts.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, what is the reason you don't want to merge it?

@desh2608
Copy link
Collaborator Author

This is cool, what is the reason you don't want to merge it?

Mainly because this approach isn't really benchmarked on anything, and I am not sure how well the ECAPA-TDNN embeddings would work with agglomerative clustering.

@flyingleafe
Copy link
Contributor

@desh2608 pyannote.audio is basically ECAPA-TDNN + agglomerative clustering, and it is benchmarked quite well.
(https://github.com/pyannote/pyannote-audio)
Why not use it directly?

@desh2608
Copy link
Collaborator Author

@desh2608 pyannote.audio is basically ECAPA-TDNN + agglomerative clustering, and it is benchmarked quite well. (https://github.com/pyannote/pyannote-audio) Why not use it directly?

I think that was in the older Pyannote, if I'm not mistaken? Pyannote 2.0 uses end-to-end segmentation which performs much better. In any case, this was just a quick DIY workflow. It should be relatively easy for folks to just use Pyannote to create RTTMs and then use the SupervisionSet.from_rttm() to create Lhotse manifests.

@flyingleafe
Copy link
Contributor

@desh2608 pyannote.audio is basically ECAPA-TDNN + agglomerative clustering, and it is benchmarked quite well. (https://github.com/pyannote/pyannote-audio) Why not use it directly?

I think that was in the older Pyannote, if I'm not mistaken? Pyannote 2.0 uses end-to-end segmentation which performs much better. In any case, this was just a quick DIY workflow. It should be relatively easy for folks to just use Pyannote to create RTTMs and then use the SupervisionSet.from_rttm() to create Lhotse manifests.

Well, not quite, the segmentation model in Pyannote 2.0 is a first step, the assignment of speakers to the segments is still done with ECAPA-TDNN + clustering. But whatever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants