This repository proposes a diffusion-based generative model for the synthesis of natural speech from ECoG recordings of human brain activity. It supports pretraining of unconditional or class-conditional speech generators with consecutive fine-tuning on brain recordings, or fully end-to-end training on brain recordings. The diffusion model used is DiffWave: A Versatile Diffusion Model for Audio Synthesis, with different encoder models for the encoding of brain activity inputs or class labels. Originally, this repository started from albertfgu's implementation of DiffWave.
Speech samples generated for all models are provided here.
The VariaNTS dataset that was used as the speech dataset for this research can be downloaded here.
You have to rename the incorrectly labelled 'foฬhn.wav' file for speaker p01 found in p01/p01_words/ to 'fohn.wav', as it is for the other speakers, to ensure that the data processing functions correctly.
All training functionality is implemented in the Learner
class (see learner.py
). To train a model, call the train.py
script, which loads the configurations and runs the Learner
, and allows for single-GPU or distributed training.
There are five important configuration paradigms: the model, the dataset, the diffusion process, and the training and generation controls.
The first three are collected in experiments for easy reproducibility. The respective configurations can be found in the configs/experiment/
directory. Each experiment config needs to link to a model and dataset config (separately defined in the configs/model/
and configs/dataset/
directories, respectively) as well as define the parameters for the diffusion process. It is best to think of an experiment as a recipe, and the dataset, model, and diffusion parameters as the ingredients, i.e. ingredients are reused between experiments, but combined differently. If you want to define a new experiment, you can reuse ingredients or define new ones.
All of the default config values that will be loaded when calling the train.py
script are defined in configs/config.yaml
, with explanations about their effect. To overwrite values, simply pass them as command line arguments. For example, to change the experiment, pass experiment=my_exp_config
, or to change the name of the training run, pass train.name=Another-Run-v3
.
Note: Hydra uses a hierarchical configuration system. This means that you need to prepend the commandline argument with the appropriate section name, e.g.
train.ckpt_epoch=300
orgenerate.n_samples=6
.But, since an 'experiment' is just a collection of configs, you must not do e.g.
experiment.diffusion.T=20
, but instead onlydiffusion.T=20
. Same goes for model and dataset configs.
The repository implements logging via Weights & Biases, as well as the storage of model artefacts and model-created audio.
Before you start logging, you need to:
- Setup a project on W&B,
- Login to W&B in you terminal, and
- Change the
project
andentity
entries in thewandb
section of theconfigs/config.yaml
file accordingly. This can be done once in your local copy of the repository, such that you don't need to pass it every time when calling the train script.
There are additional configuration options to control the W&B logger:
wandb.mode
:disabled
by default. Passwandb.mode=online
to activate logging to W&B. This will prompt you to login to W&B if you haven't done so.- To continue logging to a previously tracked run, obtain the run's ID from your project page on W&B (it's in the URL, not the run's name) and pass
wandb.id=<run-id>
as well as+wandb.resume=true
(note the leading '+' sign).
If you want to quickly test code, you can run smaller versions of a dataset for debugging:
- When using VariaNTS words as audio data, pass
dataset.splits_path=datasplits/VariaNTS/tiny_subset/
to only load few audio samples each epoch (assuming you have downloaded the provided datasplits, else you may create your own small subset) - Pass
diffusion.T=5
to reduce the number of diffusion steps in generation
There is a dedicated class Sampler
in sampler.py
that handles generation. It needs to be provided with a diffusion and dataset config and the appropriate generation parameters on initialization, and can then be used to run the full diffusion process to generate samples from noise (both conditional and unconditional).
In this repository, there are two places where generation takes place:
- During training: The
Sampler
is initialized at the beginning of training and repeatedly runs on the updated model everyepochs_per_ckpt
epochs. - Using the
generate.py
script: When training is finished and a model is obtained, this script can be run individually to obtain more outputs from the model.
Generation during training happens automatically (see learner.py
for the implementation). Below is a description of how to run the generate.py
script:
- Navigate to the directory in which your
exp
output directory resides, as the script uses relative paths for model loading and output saving. - Call the script with the appropriate configuration options (see below). Like for training, config defaults are loaded from the
configs/config.yaml
file, and can be overwritten with command line arguments.
All configuration options can be found in the configs/config.yaml
file's generate
section, but the important ones (i.e. the ones changed most frequently) are listed below:
experiment
: Name of the experiment config as found inconfigs/experiment
that was used during the training run. This will load the required diffusion, dataset and model config.name
: Name of the training run that created the trained model to be used for generating, which is the folder name in theexp/
directoryconditional_type
: Eitherclass
for class conditional sampling, orbrain
for brain conditional sampling. If the model is unconditional, can be null, or will otherwise be ignored.conditional_signal
: The actual conditional input signal to use in case of conditional sampling. If the model is a class-conditional model, it suffices to simply state the word to be generated here (e.g.dag
). If it is a brain-conditional model, this should be a full (absolute) file path to the ECoG recording file on disk (e.g./home/user/path/to/recording/dag1.npy
).
Note: Since the
experiment
config is a top-level config, it suffices to appendexperiment=...
as argument when calling the script. All other of the above mentioned options are options specifically for generation, so 'generate' has to be added to the argument description, e.g.:generate.name=... generate.conditional_type=...
et cetera.
python src/generate.py experiment=my-custom-experiment generate.name=Example-Model-v2 generate.conditional_type=class generate.conditional_signal=dag
This codebase was originally forked from the DiffWave-SaShiMi repository by albertfgu, and some inspiration was taken from LMNT's implementation of DiffWave, specifically how the model code was structured.
Examples of how experiments can be run. Given parameters need to be changed.
In case only a subset of all available GPUs should be used, add CUDA_VISIBLE_DEVICES=<id>,<id>
before calling the script.
If debugging, add diffusion.T=5
as flag, which will reduce the number of diffusion steps during generation. For VariaNTS-based models, you can also append dataset.splits_path=datasplits/VariaNTS/tiny_subset
to only use a few audio files in total.
Note: You also have to specify
generate.conditional_signal=null
for this experiment such that the script does not load conditional input, as the default is set for brain input.
python src/train.py \
train.name=delete-me \
experiment=SG-U \
train.n_epochs=2 \
train.epochs_per_ckpt=1 \
train.iters_per_logging=1 \
generate.conditional_signal=null
Note: You also have to specify
generate.conditional_type=class
andgenerate.conditional_signal=<word>
for this experiment to determine which word to load for intermediate generations, as the default is set for brain input.
python src/train.py \
train.name=delete-me \
experiment=SG-C \
train.n_epochs=2 \
train.epochs_per_ckpt=1 \
train.iters_per_logging=1 \
generate.conditional_type=class \
generate.conditional_signal=dag
Harry Potter speech data:
python src/train.py \
train.name=delete-me \
experiment=B2S-Ur \
model.freeze_generator=false \
train.n_epochs=2 \
train.epochs_per_ckpt=1 \
train.iters_per_logging=1
VariaNTS speech data:
python src/train.py \
train.name=delete-me \
experiment=B2S-Uv \
model.freeze_generator=true \
train.n_epochs=2 \
train.epochs_per_ckpt=1 \
train.iters_per_logging=1
python src/train.py \
train.name=delete-me \
experiment=B2S-Cv \
train.n_epochs=2 \
train.epochs_per_ckpt=1 \
train.iters_per_logging=1
Experiment | Model | Conditional Input | Speech Data | Splits |
---|---|---|---|---|
Uncond. Pretraining | DiffWave | - | VariaNTS | full VariaNTS |
Classcond. Pretraining | DiffWave + Class Encoder | Class vector | VariaNTS | full VariaNTS |
Brainclasscond. Finetuning | DiffWave + Class Encoder + Brain Classifier | HP ECoG Data | VariaNTS | reduced VariaNTS* |
Braincond. Finetuning (VNTS) | DiffWave + Brain Encoder | HP ECoG Data | VariaNTS | reduced VariaNTS* |
Braincond. Finetuning (HP) | DiffWave + Brain Encoder | HP ECoG Data | HP Speech | HP splits |
* reduced VariaNTS means that words which are not present in the Harry Potter ECoG data were removed, in order to correctly map ECoG to speech data.