Skip to content

RVC (Retrieval‐based Voice Conversion)

erew123 edited this page Dec 21, 2024 · 9 revisions

RVC enhances TTS by replicating voice characteristics for characters or narrators, adding depth to synthesized speech. It functions as a TTS-to-TTS pipeline and can be used with any TTS engine/model. For optimal performance, it's recommended to use a voice cloning TTS engine like Coqui XTTS with voice samples.

Setup

You will need to first Enable RVC in the Global Settings > RVC Settings tab and click the Update RVC Settings button, AllTalk will create the necessary folders and download any missing model files required for RVC to work.

image

Voice Model Files

  • Store voice models in the /models/rvc_voices/{subfolder} directory in their own individual subfolder. The rvc_voices folder is created when RVC is enabled in the Gradio interface.
  • A voice model typically includes a PTH file and potentially an index file.
  • If an index file is present, AllTalk will automatically select and use it.
  • If multiple index files are found, none will be used, and a message will be output to the console.
  • You can find 100,000+ pre-generated RVC voice models on sites like voice-models.com and Hugging Face.
  • There is currently no RVC voice model creation within AllTalk, it is on the TODO list (should be in the next release).

📁 models
└── 📁 rvc_voices
          ├── 📁 voice_model_1
          │        ├── model.pth
          │        └── index.json
          └── 📁 voice_model_2
                    │── model.pth
                    └── index.json

Purpose of the Index File

The index file helps improve the quality of the generated audio by providing a reference during the conversion process. The FAISS index enables faster and more accurate retrieval of voice characteristics, leading to more natural and high-quality voice synthesis.

Model Caching

AllTalk implements LRU (Least Recently Used) caching for RVC models and embedders to optimize performance. The system caches up to 3 voice models in memory, automatically removing the least recently used model when loading a new one. This means:

  • Frequently used voice models stay in memory, reducing load times
  • When a fourth model is loaded, the least recently used model is unloaded
  • Embedder models (hubert/contentvec) are cached separately

This caching system helps balance memory usage with performance, particularly beneficial when using the same voices repeatedly in a session.

RVC Settings

Default Character Voice Model

  • Selects the voice model used for character conversion.
  • If "Disabled" is selected, RVC will not be applied to character voices.
  • This option is used only if RVC is enabled and no other voice is specified in the API request.

Default Narrator Voice Model

  • Selects the voice model used for narrator conversion.
  • If "Disabled" is selected, RVC will not be applied to the narrator voice.
  • This option is used only if RVC is enabled and no other voice is specified in the API request.

Index Influence Ratio

  • Sets the influence exerted by the index file on the final output.
  • A higher value increases the impact of the index, potentially enhancing detail but also increasing the risk of artifacts.

Pitch

  • Sets the pitch of the audio output.
  • Increasing the value raises the pitch, while decreasing the value lowers it.

Volume Envelope

  • Substitutes or blends with the volume envelope of the output.
  • A ratio closer to 1 means the output envelope is more heavily employed.

Protect Voiceless Consonants/Breath Sounds

  • Prevents artifacts in voiceless consonants and breath sounds.
  • Higher values (up to 0.5) provide stronger protection but might affect indexing.

AutoTune

  • Enables or disables auto-tune for the generated audio.
  • Recommended for singing conversions to ensure the output remains in tune.

Filter Radius

  • If the number is greater than or equal to three, employing median filtering on the collected tone results has the potential to decrease respiration.

Training Data Size (AllTalk Specific)

  • Determines the number of training data points used to train the FAISS index.
  • Increasing the size may improve the quality of the output but can also increase computation time.
  • Different index files have different sizes. This setting limits the maximum amount of the index used.

Embedder Model

  • Select between different models for learning speaker embedding.
  • Options:
    • hubert: Focuses on capturing phonetic and linguistic content.
    • contentvec: Captures more detailed voice characteristics and nuances.

Split Audio

  • Splits the audio into chunks for inference to obtain better results in some cases.
  • Can improve the quality of conversion, especially for longer audio inputs.

Pitch Extraction Algorithm

  • Choose the algorithm used for extracting the pitch (F0) during audio conversion.
  • Options include:
    • crepe: High accuracy, robust against noise.
    • crepe-tiny: Smaller, faster version of crepe with slightly reduced accuracy.
    • dio: Fast, less accurate, suitable for real-time applications.
    • fcpe: Focuses on precise pitch extraction.
    • harvest: Produces smooth and natural pitch contours.
    • hybrid[rmvpe+fcpe]: Combines strengths of rmvpe and fcpe.
    • pm: Robust algorithm with a balance of speed and accuracy.
    • rmvpe: Recommended for most cases, especially in TTS applications.
Clone this wiki locally