Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat:
support tts mode:
text+ref audio waveform -> tokenizer -> text+audio token ids -> step1 lm -> audio token ids (wav_code) -> flow(CFM) -> mel - vocoder(HiFT) -> waveform
src+ref audio waveform -> speech tokenizer-> audio token ids (wav_code) -> flow(CFM) -> mel - vocoder(HiFT) -> clone ref audio waveform
colab 笔记:
step-audio TTS from step-audio (Speech Decoder)
step1 LM 3B + flow (code from CosyVoice)+ HiFT(code from CosyVoice)
speech tokenizer
a dual codebook speech tokenizer framework. like ARCON (from stepfun team);
linguistic tokenizer use FunASR Paraformer(NAR) model;
semantic tokenizer use CosyVoice speech tokenizer(from SenseVoice)
step1 LM 3B from step-audio 130B distillation
flow (CFM)
see:
HiFT vocoder
see: