This is an solution for background music generation task in Zalo AI Challenge 2023. By using latent diffusion on audio domain, we can generate audio/music from a prompt. We chose a checkpoint in AudioLDM2 and finetuned on the challenge dataset. Training and evaluation script are provided for reproduction. You can try our submited model now by HuggingFace or Colab notebook
pip install --upgrade diffusers transformers accelerate
from diffusers import AudioLDM2Pipeline
import torch
import scipy
repo_id = "vtrungnhan9/audioldm2-music-zac2023"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "This music is instrumental. The tempo is medium with synthesiser arrangements, digital drums and electronic music. The music is upbeat, pulsating, youthful, buoyant, exciting, punchy, psychedelic and has propulsive beats with a dance groove. This music is Techno Pop/EDM."
neg_prompt = "bad quality"
audio = pipe(prompt, negative_prompt=neg_prompt, num_inference_steps=200, audio_length_in_s=10.0, guidance_scale=10).audios[0]
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
export CUDA_VISIBLE_DEVICES=0
accelerate launch train.py \
--train_file="/code/data/train_main.json" \
--validation_file="/code/data/val_main.json" \
--freeze_text_encoder \
--gradient_accumulation_steps 1 --per_device_train_batch_size=32 --per_device_eval_batch_size=4 \
--learning_rate=3e-5 --num_train_epochs 200 --snr_gamma 5 \
--text_column captions --audio_column location --checkpointing_steps="best" \
--seed 123 \
--save_every 25