Take a grain of SALT to that voice!
TL;DR: A speaker anonymization and interpolation tool based on WavLM hidden space transformation.
Official code implementation for ASRU23 paper SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation.
Try it out interactively at colab:
- Install dependancies: we have same dependencies as knn-vc:
torch
,torchaudio
,numpy
. And we also havepandas
for data processing andgradio
for web demo. - Download prebuilt speaker packs (Optional):
cd assets
wget https://github.com/BakerBunker/SALT/releases/download/1.0.0/librispeech-pack.zip
unzip librispeech-pack.zip
- Load model:
import torch
anon = torch.hub.load('BakerBunker/SALT','salt', trust_repo=True, pretrained=True, base=True, device='cuda')
# base=True if use WavLM-Base as feature extractor
- Make speaker packs (Optional)
path=anon.make_speaker_pack(['tensor_or_path_to_wav',...],speaker_name)
- Add speakers:
anon.add_speaker('example',wavs=['tensor_or_path_to_wav',...])
#OR add .pack file by
anon.add_speaker('example',preprocessed_file='example.pack')
- Mix speakers:
wav=anon.interpolate(
'tensor_or_path_to_wav',
# Pandas Dataframe with column 'speaker' and 'weight'
#OR
# dict with {'speaker':weight},
topk=4, #K for k-NN
# OR use chunked mode for long wav
chunksize=5, #5 sec for one chunk
padding=0.5, #pad 0.5 sec for head and tail each chunk
)
WavLM-Large and corresponding vocoder is available at kNN-VC.
WavLM-Base and corresponding vocoder is available at release page.
Training process is same as kNN-VC.
Huge THANKS to kNN-VC and the authors, our code is largely based on this repository.
kNN-VC: https://github.com/bshall/knn-vc
Part of the code is based on:
HiFiGAN: https://github.com/jik876/hifi-gan
WavLM: https://github.com/microsoft/unilm/tree/master/wavlm
@inproceedings{Lv2023SALTDS,
title={SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation},
author={Yuanjun Lv and Jixun Yao and Peikun Chen and Hongbin Zhou and Heng Lu and Lei Xie},
year={2023},
booktitle={2023 IEEE Automatic Speech Recognition and Understanding (ASRU)},
}