A neural adapter that converts LLaMA embeddings directly into speech using F5-TTS. This project enables text-to-speech generation by bridging large language models with voice synthesis.
This project implements an adapter architecture that:
- Takes embeddings from LLaMA language model
- Processes them through a specialized transformer-based adapter
- Generates mel spectrograms compatible with F5-TTS
- Synthesizes high-quality speech output
- Direct conversion from LLaMA embeddings to speech
- Enhanced transformer architecture with DiT-style blocks
- Multi-scale feature matching and temporal consistency
- Voice cloning capabilities through F5-TTS integration
- Continuous learning support for voice profiles
- Python 3.8+
- PyTorch 2.0+
- Transformers
- torchaudio
- sounddevice
- pydub
Clone the repository
git clone https://github.com/peytontolbert/llama-to-speech-adapter
Install dependencies
pip install -r requirements.txt
Download model weights (Instructions for obtaining F5-TTS weights and LLaMA model)
python train.py
This will:
- Generate a dataset of LLaMA embeddings paired with mel spectrograms
- Train the adapter model
- Save checkpoints during training
python run.py
This will:
- Load the trained adapter
- Generate speech from text using the LLaMA-to-Speech pipeline
├── adapter/
│ └── adapter.py # Enhanced embedding adapter implementation
├── f5/
│ ├── f5tts.py # F5-TTS service and voice profile management
│ ├── dit.py # DiT model implementation
│ ├── cfm.py # CFM model implementation
│ ├── modules.py # F5-TTS modules
│ ├── utils_infer.py # TTS utilities for inference
│ └── utils.py # TTS utilities
├── scripts/
│ └── generate_dataset.py # Script to generate dataset
├── train.py # Training script
├── run.py # Inference script
├── requirements.txt # Dependencies
└── README.md
The adapter uses a transformer-based architecture with:
- DiT-style blocks for processing embeddings
- Relative positional embeddings
- Multi-scale feature matching
- Temporal consistency preservation
- Mel spectrogram range normalization
The model is trained using two prediction components:
- Duration Loss
- Mel Loss
- F5-TTS for the text-to-speech foundation
- Meta AI for the LLaMA language model
Contributions are welcome! Please feel free to submit a Pull Request.