English | 简体中文
This script aims to identify speakers with stable voice timbre.
The evaluation is based on the Institute for Intelligent Computing's ERes2NetV2 speaker recognition model.
audio_generator.py
: Script to generate random speakers and create test audioconsistency_evaluator.py
: Script to evaluate stabilitytest_data.yaml
: Text data for generating audio
Please ensure you have installed the ChatTTS project and it is running correctly. Installation steps
Clone this project and copy the speaker_consistency folder to the root directory of the ChatTTS project.
The directory should look like this:
├── ChatTTS
│ ├── __init__.py
│ ├── core.py
│ └── ...
└── speaker_consistency
├── audio_generator.py
├── consistency_evaluator.py
├── requirements.txt
└── test_data.yaml
Install dependencies:
pip install -r ./speaker_consistency/requirements.txt
Run the audio_generator.py
script to generate test audio:
python speaker_consistency/audio_generator.py
Run the consistency_evaluator.py
script to perform the evaluation:
python speaker_consistency/consistency_evaluator.py
The evaluation results will be saved in the evaluation_results.csv
file.
You can edit the test_data.yaml
file to modify the test text.
python audio_generator.py --dir <output_directory> --num <number_of_speakers> --ds <dataset_yaml_file>
Parameter Description:
--dir
: Directory to store the generated test audio files. Default is./test_audio
.--num
: Number of random speakers to generate. Default is10
.--ds
: Path to the dataset YAML file. Default istest_data.yaml
.
python consistency_evaluator.py --dir <output_directory>
Parameter Description:
--dir
: Directory containing the audio files generated in the previous step. Default is./test_audio
.
The cosine similarity between the embedding vectors of each pair of audio segments is calculated to obtain the mean and standard deviation of the similarity. After normalizing the standard deviation, the rank
metric assigns a weight of 70% to the mean and 30% to the standard deviation. Generally, the higher the rank
metric, the better the consistency of the audio segments.
id | ... | rank_TestA | rank_TestB |
---|---|---|---|
0000 | ... | 0.802779 | 0.809263 |
0001 | ... | 0.858448 | 0.773149 |
0002 | ... | 0.763376 | 0.779981 |
Speakers with consistently high scores across all sample sets have relatively stable voice timbre. You can adjust the test set according to your specific use case.
The speaker timbre is stored by default in the directory of the generated audio under the folder, with the filename speaker.pt
.
You can load the speaker timbre with the following code:
spk = torch.load(<PT-FILE-PATH>, map_location=torch.device("cpu"))
params_infer_code = {
'spk_emb': spk,
}
The evaluation results may be limited by the quantity and diversity of the samples.