- A Review of Speaker Diarization: Recent Advances with Deep Learning, 2021
- A review on speaker diarization systems and approaches, 2012
- Speaker diarization: A review of recent research, 2010
- Online Speaker Diarization with Relation Network
- An End-to-End Speaker Diarization Service for improving Multimedia Content Access
- Spot the conversation: speaker diarisation in the wild
- Speaker Diarization with Region Proposal Network
- Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
- Supervised online diarization with sample mean loss for multi-domain data
- Discriminative Neural Clustering for Speaker Diarisation
- End-to-End Neural Speaker Diarization with Permutation-Free Objectives
- Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection
- Speaker diarization using latent space clustering in generative adversarial network
- A study of semi-supervised speaker diarization system using gan mixture model
- Learning deep representations by multilayer bootstrap networks for speaker diarization
- Enhancements for Audio-only Diarization Systems
- LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
- Meeting Transcription Using Virtual Microphone Arrays
- Speaker diarisation using 2D self-attentive combination of embeddings
- Speaker Diarization with Lexical Information
- Fully Supervised Speaker Diarization
- Neural speech turn segmentation and affinity propagation for speaker diarization
- Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks
- Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks
- Speaker Diarization with LSTM
- Speaker diarization using deep neural network embeddings
- Speaker diarization using convolutional neural network for statistics accumulation refinement
- pyannote. metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
- Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks
- Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
- Self-supervised learning for audio-visual speaker diarization (Ding Y, Xu Y, Zhang S X, et al. ICASSP, 2020) An audio-visual methods for speaker diarization, based one contrast learning.
-
Joint Speech Recognition and Speaker Diarization via Sequence Transduction (Shafey L E, Soltau H, Shafran I) Based on RNN-T structure, combine text content and voice point information.
-
Look Who's Not Talking (Kwon Y, Heo H S, Huh J, et al. (VGG)) Since the speaker embeddings are able to discriminate one person's speech from another, it might be able to discriminate speech from non-speech.
-
Self-Supervised Learning of Audio-Visual Objects from Video (Afouras, Triantafyllos and Owens) Leverage cross modal attention to contrastive learning.
-
Multiple Sound Sources Localization from Coarse to Fine (Qian R, Di Hu H D, Wu M, et al.) Leverage CAM to sound source localization.
- Dual Attention Matching for Audio-Visual Event Localization (Wu Y, Zhu L, Yan Y, et al. IEEE, 2019) Combine local feature and global feature to estimate relevant frames.
- Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
- Audio-Visual Event Localization in Unconstrained Videos (Tian Y, Shi J, Li B, et al. ECCV, 2018) Cross modal attention for localization. Aligned frames' features are closer.
- Learning to Localize Sound Sources in Visual Scenes (Senocak A, Oh T H, Kim J, et al. IEEE, 2018) Cross modal attention and contrastive learning.
- VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
- Visual speech enhancement (Gabbay, Aviv and Shamir) Combine video frames and mix audio to generate clean audio.
-
Spot the conversation: speaker diarisation in the wild (Chung J S, Huh J, Nagrani A, et al.(VGG)) A free speaker diarization dataset.(Large dataset with overlapping speeches and background noise)
-
VoxConverse VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos.
- pyannote audio: neural building blocks for speaker diarization by Hervé Bredin
- Google's Diarization System: Speaker Diarization with LSTM by Google
- Fully Supervised Speaker Diarization: Say Goodbye to clustering by Google
- Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings by Microsoft Research
- Robust Speaker Diarization for Meetings: the ICSI system by Microsoft Research
- 【机器之心&博文视点】入门声纹技术|第二讲:声纹分割聚类与其他应用 by Quan Wang
- Voice Identity Techniques: From core algorithms to engineering practice (Chinese) by Quan Wang, 2020
Quan Wang's repo inspires us a lot. Many thanks!