LLaMA-to-Speech Adapter

A neural adapter that converts LLaMA embeddings directly into speech using F5-TTS. This project enables text-to-speech generation by bridging large language models with voice synthesis.

Overview

This project implements an adapter architecture that:

Takes embeddings from LLaMA language model
Processes them through a specialized transformer-based adapter
Generates mel spectrograms compatible with F5-TTS
Synthesizes high-quality speech output

Features

Direct conversion from LLaMA embeddings to speech
Enhanced transformer architecture with DiT-style blocks
Multi-scale feature matching and temporal consistency
Voice cloning capabilities through F5-TTS integration
Continuous learning support for voice profiles

Requirements

Python 3.8+
PyTorch 2.0+
Transformers
torchaudio
sounddevice
pydub

Installation

Clone the repository
git clone https://github.com/peytontolbert/llama-to-speech-adapter

Install dependencies

pip install -r requirements.txt

Download model weights (Instructions for obtaining F5-TTS weights and LLaMA model)

Usage

Training the Adapter

python train.py

This will:

Generate a dataset of LLaMA embeddings paired with mel spectrograms
Train the adapter model
Save checkpoints during training

Generating Speech

python run.py

This will:

Load the trained adapter
Generate speech from text using the LLaMA-to-Speech pipeline

Project Structure

├── adapter/
│ └── adapter.py # Enhanced embedding adapter implementation
├── f5/
│ ├── f5tts.py # F5-TTS service and voice profile management
│ ├── dit.py # DiT model implementation
│ ├── cfm.py # CFM model implementation
│ ├── modules.py # F5-TTS modules
│ ├── utils_infer.py # TTS utilities for inference
│ └── utils.py # TTS utilities
├── scripts/
│ └── generate_dataset.py # Script to generate dataset
├── train.py # Training script
├── run.py # Inference script
├── requirements.txt # Dependencies
└── README.md

Model Architecture

The adapter uses a transformer-based architecture with:

DiT-style blocks for processing embeddings
Relative positional embeddings
Multi-scale feature matching
Temporal consistency preservation
Mel spectrogram range normalization

Training

The model is trained using two prediction components:

Duration Loss
Mel Loss

Acknowledgments

F5-TTS for the text-to-speech foundation
Meta AI for the LLaMA language model

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaMA-to-Speech Adapter

Overview

Features

Requirements

Installation

Usage

Training the Adapter

Generating Speech

Project Structure

Model Architecture

Training

Acknowledgments

Contributing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
adapter		adapter
f5		f5
scripts		scripts
test		test
utils		utils
voice_profiles		voice_profiles
.gitignore		.gitignore
README.md		README.md
analyze_mel.py		analyze_mel.py
eval_csm.py		eval_csm.py
generate_dataset.py		generate_dataset.py
requirements.txt		requirements.txt
run.py		run.py
train.py		train.py
train_csm.py		train_csm.py

peytontolbert/llama-f5-adapter

Folders and files

Latest commit

History

Repository files navigation

LLaMA-to-Speech Adapter

Overview

Features

Requirements

Installation

Usage

Training the Adapter

Generating Speech

Project Structure

Model Architecture

Training

Acknowledgments

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages