Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Support for Wavenet vocoder #21

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

WIP: Support for Wavenet vocoder #21

wants to merge 12 commits into from

Conversation

r9y9
Copy link
Owner

@r9y9 r9y9 commented Jan 6, 2018

  • Add script to generate training data for wavenet vocoder
  • Training deepvoice3 for wavenet vocoder
  • Training WaveNet vocoder
  • Add a option to use WaveNet vocoder to synthesis.py
  • Get quality imprevement

ref #11, r9y9/wavenet_vocoder#1

@nikitos9000
Copy link

I'm just wondering, what kind of data should I pass to generate_aligned_predictions.py to produce aligned mel-predictions for WaveNet? Should these audio files be preprocessed somehow (as well as mel-spectrograms)?

@r9y9
Copy link
Owner Author

r9y9 commented Mar 3, 2018

This is very WIP so may change in future, but for now I use the following command:

python generate_aligned_predictions.py \
   ./checkpoints_deepvoice3_wavenet/checkpoint_step000770000.pth \  
   ~/Dropbox/sp/wavenet_vocoder/data/ljspeech/ --preset=presets/deepvoice3_ljspeech_wavenet.json \ 
    ~/Dropbox/sp/wavenet_vocoder/data/ljspeech_deepvoice3 

You need to pass:

  • Model checkpoint of DeepVoice3 (or similar)
  • Mel-spectrograms to be used for generate aligned predictions (inside ~/Dropbox/sp/wavenet_vocoder/data/ljspeech/ in my case). Raw audio is not used to generate predictions, but used to make sure we have correct time resolutions.
    # Make sure we have correct lengths
    assert mel_output.shape[0] * hparams.hop_size == len(wav)

@r9y9
Copy link
Owner Author

r9y9 commented Mar 3, 2018

Okay, still quite alpha, but seems started to work.

DeepVoice3_wavenet_quite_alpha_770k_for_deepvoice3_6k_for_wavenet.zip

EDIT: Trained WaveNet for 60k steps, starting from pre-trained model r9y9/wavenet_vocoder#19 (comment)

@nikitos9000
Copy link

@r9y9 Yes, thanks, I ran generate_aligned_predictions.py on deepvoice3 ljspeech data, not on wavenet data, so faced with some problems there. Now it's clear.

BTW, do you need any help with DeepVoice3 + WaveNet experiment? I reproduced your steps, but for now, it doesn't sound as good as in Baidu or Google demos (while WaveNet itself sounds very good on mels). So I'm wondering — what is the reason and what should we try to improve that. Do you have any ideas?

@r9y9
Copy link
Owner Author

r9y9 commented Mar 8, 2018

@nsmetanin Yes, I'm happy if you could help. I also haven't got as good results as Google demos. Currently I'm getting very coarse mel-spectrogram predictions with DeepVoice3 but I think we should be able to get sufficient precise mel-spectrogram, otherwise we may end up with noisy speech. I want to try outputs_per_step=1 as mentioned in Tacotron2 but I have an issue with the configuration (#24). Attention encoder/decoder models are tricky to train...

I am planning to try increasing kernel_size, encoder/decoder channels of DeepVoice3 to make the model more expressive.

@nikitos9000
Copy link

Also, there are parameters that should match both for DeepVoice3 output and WaveNet input, like preemphasis value, rescaling, and others. It wasn't clearly stated in those articles, what should we use, so I just want to try some combinations.

For example, if you trained WaveNet with rescaling=True and trying to put predictions of DeepVoice3 which was trained with rescaling=False, it will sound awful. Disabling preemphasis makes DeepVoice3 itself sound much worse, so that could a problem too. I want to try enabling preemphasis for mels both for DV3 and WV, and train WV to produce raw audio from mels with preemphasis.

@ilyalasy
Copy link
Contributor

ilyalasy commented May 7, 2020

Sorry, cant actually get what generate_aligned_predictions.py does. Can you clarify a bit?
Do I need to train wavenet from original mels generated by wavenet preprocess?
Or I can use mels generated by deepvoice preprocess?
In case I need to use wavenet's preprocess what parameters should I copy so they will be the same as in deepvoice?
P.S. I'm trying to train both models on my own dataset (not english).
P.P.S. Sorry for silly questions :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants