The human brain has the capability to associate the unknown person's voice and face by leveraging their general relationship, referred to as cross-modal speaker verification. This task poses significant challenges due to the complex relationship between the modalities. In this paper, we propose a Multi-stage Face-voice Association Learning with Keynote Speaker Diarization (MFV-KSD) framework. MFV-KSD contains a keynote speaker diarization front-end to effectively address the noisy speech inputs issue. To balance and enhance the intra-modal feature learning and inter-modal correlation understanding, MFV-KSD utilizes a novel three-stage training strategy. Our experimental results demonstrated robust performance, achieving the first rank in the 2024 Face-voice Association in Multilingual Environments (FAME) challenge with an overall Equal Error Rate (EER) of 19.9%.
- The original MAV-Celeb dataset can be found here, the FAME 2024 challenge page can be found here.
- Our system contains multiple stages, we did not include the code about the keynote speaker diarization, while the cleansed speech parts can be downloaded here
- For three-stage training in our paper, we did not include the code about the intra-modal recognition learning, the face recognition learning process and model can be found at here, the speaker recognition process and model can be found at here. The pre-trained face encoder, speaker encoder (seen, English unseen, Urdu unseen) can be found at here
- The inter-modal verifiaction learning shares the similar code with FAME adaption, the training lists can be found at here. Vox_FAME.txt is for the stage 2, FAME.txt is for the stage 3. The unseen list can be generated with the similar format.
- We use the validation set during challenge, in this code, we did not put the validation list/code but using the testing code directly. The validation-related code can be easily added.
- Running with
bash run_train.sh