Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding pitch models (Reflow vs DDPM) #193

Open
ariikamusic opened this issue Jun 4, 2024 · 3 comments
Open

Question regarding pitch models (Reflow vs DDPM) #193

ariikamusic opened this issue Jun 4, 2024 · 3 comments

Comments

@ariikamusic
Copy link

Hello,

I have a question regarding the current pitch models, specifically the differences between Reflow and DDPM. With the latest update, it seems like Reflow has become the new default and recommended setting for training acoustic and variance models. While Reflow is very fast—faster than DDPM—it appears to be at the cost of quality.

I've conducted multiple experiments with my dataset of three speakers (a soprano, a mezzo-soprano, and a tenor), each with approximately three hours of Japanese singing data, using the multispeaker method. Unfortunately, the experiments using Reflow for the pitch models have been inconsistent in my experience. The speakers are all very expressive and stylized in their singing, which is rarely reflected in the results. I've tried different batch sizes, maximum steps, step sizes, and switched between L1 and L2 loss functions, but none of these adjustments have produced the desired results. Specifically, I find that Reflow does not accurately replicate the singers' styles. The resulting F0 is relatively flat, with little variation or randomness, and the singing style feels "safe" with minimal vibrato, even when the singer uses vibrato frequently.

On the other hand, experiments using DDPM have yielded much clearer and more accurate results, better replicating the singers' styles. It seems to me that DDPM trains more carefully compared to Reflow.

My question is: What could be the reason for this difference in results between these two diffusion types? Might DDPM be more suited for highly stylized and random singing, especially when using L2 loss for bigger outliers? Is Reflow more suited for singing that is less random?

Thank you in advance.

@yqzhishen
Copy link
Member

In our experiments, Reflow outperforms DDPM a lot on all types of datasets, especially for expressive ones. Furtherly, Reflow can hold worse (automatic) labels and more data/speakers. Thus, your case seems unexpected, and there may be other cause before blaming Reflow itself.

There are many factors which influence the pitch performance, like your training steps, your labels, your combination of variance modules, your choice of speedup/steps, or even your method of testing. For research purposes, I recommend reading the accuracy metrics and validation plots on TensorBoard, or using the CLI inference script in this repository. (There were cases where someone put a multi-speaker pitch model into OpenUTAU with misconfigured YAML, and the software produced wrong results without any error reports.)

Therefore, if you still cannot figure out the reason, please provide more details, for example:

  • Your configuration file
  • Accuracy and plots on TensorBoard, respectively
  • Have you really controlled all variables?
  • how did you do the tests above?

@ariikamusic
Copy link
Author

ariikamusic commented Jun 5, 2024

Hello.

Thank you for your response, it is much appreciated.

After doing more experiments, and also comparing the result with inference via command, ReFlow outperforms DDPM a lot. For some reason, the result is very different when it is generated in OpenUTAU. I did make sure the config for OpenUTAU was configured correctly, though. I wonder why it is. My apologies for blaming Reflow at first, when the issue is most likely OpenUTAU, or onnx exporting wrongly.

Thank you in advance.

@yqzhishen
Copy link
Member

yqzhishen commented Jun 5, 2024

A possible debugging method is to freeze one speaker into the model and test it in OpenUTAU. OpenUTAU encountered problems in multi-speaker cases for many times before. There are possible bugs that the result seemed okay but actually the model did not run correctly at all.

Also, it is not likely an ONNX bug if you exported the model with PyTorch 1.13 successfully, because there are other people who are using multi-speaker pitch models in OpenUTAU and can get reasonable results.

Maybe you still need to check the configuration carefully. OpenUTAU has too many undefined behaviors that can break the results without any error reporting, and only if you do everything as it expects that you can get the right outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants