🧠🔊 Brain2Speech Diffusion: Speech Generation from Brain Activity using Diffusion Models

This repository proposes a diffusion-based generative model for the synthesis of natural speech from ECoG recordings of human brain activity. It supports pretraining of unconditional or class-conditional speech generators with consecutive fine-tuning on brain recordings, or fully end-to-end training on brain recordings. The diffusion model used is DiffWave: A Versatile Diffusion Model for Audio Synthesis, with different encoder models for the encoding of brain activity inputs or class labels. Originally, this repository started from albertfgu's implementation of DiffWave.

Samples

Speech samples generated for all models are provided here.

Data

The VariaNTS dataset that was used as the speech dataset for this research can be downloaded here.

You have to rename the incorrectly labelled 'föhn.wav' file for speaker p01 found in p01/p01_words/ to 'fohn.wav', as it is for the other speakers, to ensure that the data processing functions correctly.

Usage

Training

All training functionality is implemented in the Learner class (see learner.py). To train a model, call the train.py script, which loads the configurations and runs the Learner, and allows for single-GPU or distributed training.

Configurations

There are five important configuration paradigms: the model, the dataset, the diffusion process, and the training and generation controls.

The first three are collected in experiments for easy reproducibility. The respective configurations can be found in the configs/experiment/ directory. Each experiment config needs to link to a model and dataset config (separately defined in the configs/model/ and configs/dataset/ directories, respectively) as well as define the parameters for the diffusion process. It is best to think of an experiment as a recipe, and the dataset, model, and diffusion parameters as the ingredients, i.e. ingredients are reused between experiments, but combined differently. If you want to define a new experiment, you can reuse ingredients or define new ones.

All of the default config values that will be loaded when calling the train.py script are defined in configs/config.yaml, with explanations about their effect. To overwrite values, simply pass them as command line arguments. For example, to change the experiment, pass experiment=my_exp_config, or to change the name of the training run, pass train.name=Another-Run-v3.

Note: Hydra uses a hierarchical configuration system. This means that you need to prepend the commandline argument with the appropriate section name, e.g. train.ckpt_epoch=300 or generate.n_samples=6.

But, since an 'experiment' is just a collection of configs, you must not do e.g. experiment.diffusion.T=20, but instead only diffusion.T=20. Same goes for model and dataset configs.

Logging

The repository implements logging via Weights & Biases, as well as the storage of model artefacts and model-created audio.

Before you start logging, you need to:

Setup a project on W&B,
Login to W&B in you terminal, and
Change the project and entity entries in the wandb section of the configs/config.yaml file accordingly. This can be done once in your local copy of the repository, such that you don't need to pass it every time when calling the train script.

There are additional configuration options to control the W&B logger:

wandb.mode: disabled by default. Pass wandb.mode=online to activate logging to W&B. This will prompt you to login to W&B if you haven't done so.
To continue logging to a previously tracked run, obtain the run's ID from your project page on W&B (it's in the URL, not the run's name) and pass wandb.id=<run-id> as well as +wandb.resume=true (note the leading '+' sign).

Debugging

If you want to quickly test code, you can run smaller versions of a dataset for debugging:

When using VariaNTS words as audio data, pass dataset.splits_path=datasplits/VariaNTS/tiny_subset/ to only load few audio samples each epoch (assuming you have downloaded the provided datasplits, else you may create your own small subset)
Pass diffusion.T=5 to reduce the number of diffusion steps in generation

Generating

There is a dedicated class Sampler in sampler.py that handles generation. It needs to be provided with a diffusion and dataset config and the appropriate generation parameters on initialization, and can then be used to run the full diffusion process to generate samples from noise (both conditional and unconditional).

In this repository, there are two places where generation takes place:

During training: The Sampler is initialized at the beginning of training and repeatedly runs on the updated model every epochs_per_ckpt epochs.
Using the generate.py script: When training is finished and a model is obtained, this script can be run individually to obtain more outputs from the model.

Generation during training happens automatically (see learner.py for the implementation). Below is a description of how to run the generate.py script:

Navigate to the directory in which your exp output directory resides, as the script uses relative paths for model loading and output saving.
Call the script with the appropriate configuration options (see below). Like for training, config defaults are loaded from the configs/config.yaml file, and can be overwritten with command line arguments.

Config options

All configuration options can be found in the configs/config.yaml file's generate section, but the important ones (i.e. the ones changed most frequently) are listed below:

experiment: Name of the experiment config as found in configs/experiment that was used during the training run. This will load the required diffusion, dataset and model config.
name: Name of the training run that created the trained model to be used for generating, which is the folder name in the exp/ directory
conditional_type: Either class for class conditional sampling, or brain for brain conditional sampling. If the model is unconditional, can be null, or will otherwise be ignored.
conditional_signal: The actual conditional input signal to use in case of conditional sampling. If the model is a class-conditional model, it suffices to simply state the word to be generated here (e.g. dag). If it is a brain-conditional model, this should be a full (absolute) file path to the ECoG recording file on disk (e.g. /home/user/path/to/recording/dag1.npy).

Note: Since the experiment config is a top-level config, it suffices to append experiment=... as argument when calling the script. All other of the above mentioned options are options specifically for generation, so 'generate' has to be added to the argument description, e.g.: generate.name=... generate.conditional_type=... et cetera.

Example

python src/generate.py experiment=my-custom-experiment generate.name=Example-Model-v2 generate.conditional_type=class generate.conditional_signal=dag

Pretrained Models

Acknowledgements

This codebase was originally forked from the DiffWave-SaShiMi repository by albertfgu, and some inspiration was taken from LMNT's implementation of DiffWave, specifically how the model code was structured.

References

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Addendum

Experiment collection

Examples of how experiments can be run. Given parameters need to be changed.

In case only a subset of all available GPUs should be used, add CUDA_VISIBLE_DEVICES=<id>,<id> before calling the script.

If debugging, add diffusion.T=5 as flag, which will reduce the number of diffusion steps during generation. For VariaNTS-based models, you can also append dataset.splits_path=datasplits/VariaNTS/tiny_subset to only use a few audio files in total.

Unconditional Pretraining

Note: You also have to specify generate.conditional_signal=null for this experiment such that the script does not load conditional input, as the default is set for brain input.

python src/train.py \
    train.name=delete-me \
    experiment=SG-U \
    train.n_epochs=2 \
    train.epochs_per_ckpt=1 \
    train.iters_per_logging=1 \
    generate.conditional_signal=null

Class-Conditional Pretraining

Note: You also have to specify generate.conditional_type=class and generate.conditional_signal=<word> for this experiment to determine which word to load for intermediate generations, as the default is set for brain input.

python src/train.py \
    train.name=delete-me \
    experiment=SG-C \
    train.n_epochs=2 \
    train.epochs_per_ckpt=1 \
    train.iters_per_logging=1 \
    generate.conditional_type=class \
    generate.conditional_signal=dag

Brain-Conditional Fine-Tuning

Harry Potter speech data:

python src/train.py \
    train.name=delete-me \
    experiment=B2S-Ur \
    model.freeze_generator=false \
    train.n_epochs=2 \
    train.epochs_per_ckpt=1 \
    train.iters_per_logging=1

VariaNTS speech data:

python src/train.py \
    train.name=delete-me \
    experiment=B2S-Uv \
    model.freeze_generator=true \
    train.n_epochs=2 \
    train.epochs_per_ckpt=1 \
    train.iters_per_logging=1

Brain + Class Conditional Fine-Tuning

python src/train.py \
    train.name=delete-me \
    experiment=B2S-Cv \
    train.n_epochs=2 \
    train.epochs_per_ckpt=1 \
    train.iters_per_logging=1

Experiment Overview Table

Experiment	Model	Conditional Input	Speech Data	Splits
Uncond. Pretraining	DiffWave	-	VariaNTS	full VariaNTS
Classcond. Pretraining	DiffWave + Class Encoder	Class vector	VariaNTS	full VariaNTS
Brainclasscond. Finetuning	DiffWave + Class Encoder + Brain Classifier	HP ECoG Data	VariaNTS	reduced VariaNTS*
Braincond. Finetuning (VNTS)	DiffWave + Brain Encoder	HP ECoG Data	VariaNTS	reduced VariaNTS*
Braincond. Finetuning (HP)	DiffWave + Brain Encoder	HP ECoG Data	HP Speech	HP splits

* reduced VariaNTS means that words which are not present in the Harry Potter ECoG data were removed, in order to correctly map ECoG to speech data.

🔝 Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
configs		configs
images		images
plots		plots
slurm_jobs		slurm_jobs
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠🔊 Brain2Speech Diffusion: Speech Generation from Brain Activity using Diffusion Models

Samples

Table of Contents

Data

Usage

Training

Configurations

Logging

Debugging

Generating

Config options

Example

Pretrained Models

Acknowledgements

References

Addendum

Experiment collection

Unconditional Pretraining

Class-Conditional Pretraining

Brain-Conditional Fine-Tuning

Brain + Class Conditional Fine-Tuning

Experiment Overview Table

About

Releases

Packages

Languages

License

verrannt/brain2speech-diffusion

Folders and files

Latest commit

History

Repository files navigation

🧠🔊 Brain2Speech Diffusion: Speech Generation from Brain Activity using Diffusion Models

Samples

Table of Contents

Data

Usage

Training

Configurations

Logging

Debugging

Generating

Config options

Example

Pretrained Models

Acknowledgements

References

Addendum

Experiment collection

Unconditional Pretraining

Class-Conditional Pretraining

Brain-Conditional Fine-Tuning

Brain + Class Conditional Fine-Tuning

Experiment Overview Table

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages