Skip to content

FunAudioLLM/InspireMusic

Repository files navigation

logo

Demo Code Model Paper

InspireMusic is a fundamental AIGC toolkit and models designed for music, song, and audio generation using PyTorch.

GitHub Repo stars Please support our community project 💖 by starring it on GitHub 加⭐支持 🙏


Highlights

InspireMusic focuses on music generation, song generation and audio generation.

  • A unified framework for music/song/audio generation. Controllable with text prompts, music genres, music structures, etc.
  • Support text-to-music, music continuation, audio super-resolution, audio reconstruction tasks with high audio quality, with available sampling rates of 24kHz, 48kHz.
  • Support long audio generation in multiple output audio formats, i.e., wav, flac, mp3, m4a.
  • Convenient fine-tuning and inference. Support mixed precision training (FP16, FP32). Provide convenient fine-tuning and inference scripts and strategies, allowing users to easily fine-tune their music generation models.

What's New 🔥

  • 2025/01: Open-source InspireMusic-Base, InspireMusic-Base-24kHz, InspireMusic-1.5B, InspireMusic-1.5B-24kHz, InspireMusic-1.5B-Long models for music generation. Models are available on both ModelScope and HuggingFace.
  • 2024/12: Support to generate 48kHz audio with super resolution flow matching.
  • 2024/11: Welcome to preview 👉🏻 InspireMusic Demos 👈🏻. We're excited to share this with you and are working hard to bring even more features and models soon. Your support and feedback mean a lot to us!
  • 2024/11: We are thrilled to announce the open-sourcing of the InspireMusic code repository and demos. InspireMusic is a unified framework for music, song, and audio generation, featuring capabilities such as text-to-music conversion, music structure, genre control, and timestamp management. InspireMusic stands out for its exceptional music generation and instruction-following abilities.

Introduction

Note

This repo contains the algorithm infrastructure and some simple examples. Currently only support English text prompts.

Tip

To explore the performance, please refer to InspireMusic Demo Page. We will open-source better & larger models and demo space soon.

InspireMusic is a unified music, song and audio generation framework through the audio tokenization and detokenization process integrated with a large autoregressive transformer. The original motive of this toolkit is to empower the common users to innovate soundscapes and enhance euphony in research through music, song, and audio crafting. The toolkit provides both inference and training code for AI generative models that create high-quality music. Featuring a unified framework, InspireMusic incorporates autoregressive Transformer and conditional flow-matching modeling (CFM), allowing for the controllable generation of music, songs, and audio with both textual and structural music conditioning, as well as neural audio tokenizers. Currently, the toolkit supports text-to-music generation and plans to expand its capabilities to include text-to-song and text-to-audio generation in the future.

Installation

Clone

  • Clone the repo
git clone --recursive https://github.com/FunAudioLLM/InspireMusic.git
# If you failed to clone submodule due to network failures, please run the following command until success
cd InspireMusic
git submodule update --init --recursive

Install

InspireMusic requires Python 3.8, PyTorch 2.0.1. To install InspireMusic, you can run one of the following:

conda create -n inspiremusic python=3.8
conda activate inspiremusic
cd InspireMusic
# pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# install flash attention to speedup training
pip install flash-attn --no-build-isolation

Currently support on CUDA Version 11.x.

  • Install within the package:
cd InspireMusic
# You can run to install the packages
python setup.py install
pip install flash-attn --no-build-isolation

We also recommend having sox or ffmpeg installed, either through your system or Anaconda:

# # Install sox
# ubuntu
sudo apt-get install sox libsox-dev
# centos
sudo yum install sox sox-devel

# Install ffmpeg
# ubuntu
sudo apt-get install ffmpeg
# centos
sudo yum install ffmpeg

Quick Start

Here is a quick example inference script for music generation.

cd InspireMusic
mkdir -p pretrained_models

# Download models
# ModelScope
git clone https://www.modelscope.cn/iic/InspireMusic-1.5B-Long.git pretrained_models/InspireMusic-1.5B-Long
# HuggingFace
git clone https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long.git pretrained_models/InspireMusic-1.5B-Long

cd examples/music_generation
# run a quick inference example
bash infer_1.5b_long.sh

Here is a quick start running script to run music generation task including data preparation pipeline, model training, inference.

cd InspireMusic/examples/music_generation/
bash run.sh

One-line Inference Commands

Text-to-music Task

cd examples/music_generation
# with flow matching
python inspiremusic/bin/cli_inference.py --gpu 0 --text "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance."
# without flow matching
python inspiremusic/bin/cli_inference.py --gpu 0 --text "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." --fast 

Music Continuation Task

cd examples/music_generation
# with flow matching
python inspiremusic/bin/cli_inference.py --task continuation --gpu 0 --audio_prompt audio_prompt.wav
# without flow matching
python inspiremusic/bin/cli_inference.py --task continuation --gpu 0 --audio_prompt audio_prompt.wav --fast

Models

Download Model

We strongly recommend that you download our pretrained InspireMusic model.

If you are an expert in this field, and you are only interested in training your own InspireMusic model from scratch, you can skip this step.

# git模型下载,请确保已安装git lfs
mkdir -p pretrained_models
git clone https://www.modelscope.cn/iic/InspireMusic-1.5B-Long.git pretrained_models/InspireMusic

Available Models

Currently, we open source the music generation models support 24KHz mono and 48KHz stereo audio. The table below presents the links to the ModelScope and Huggingface model hub. More models will be available soon.

Model name Model Links Remarks
InspireMusic-Base-24kHz model model Pre-trained Music Generation Model, 24kHz mono, 30s
InspireMusic-Base model model Pre-trained Music Generation Model, 48kHz, 30s
InspireMusic-1.5B-24kHz model model Pre-trained Music Generation 1.5B Model, 24kHz mono, 30s
InspireMusic-1.5B model model Pre-trained Music Generation 1.5B Model, 48kHz, 30s
InspireMusic-1.5B-Long ⭐ model model Pre-trained Music Generation 1.5B Model, 48kHz, support long-form music generation
InspireSong-1.5B model model Pre-trained Song Generation 1.5B Model, 48kHz stereo
InspireAudio-1.5B model model Pre-trained Audio Generation 1.5B Model, 48kHz stereo
Wavtokenizer[1] (75Hz) model model An extreme low bitrate audio tokenizer for music with one codebook at 24kHz audio.
Music_tokenizer (75Hz) model model A music tokenizer based on HifiCodec[2] at 24kHz audio.
Music_tokenizer (150Hz) model model A music tokenizer based on HifiCodec at 48kHz audio.

Basic Usage

At the moment, InspireMusic contains the training code and inference code for music generation. More tasks such as song generation and audio generation will be supported in future.

Training

Here is an example to train LLM model, support FP16 training.

torchrun --nnodes=1 --nproc_per_node=8 \
    --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
    inspiremusic/bin/train.py \
    --train_engine "torch_ddp" \
    --config conf/inspiremusic.yaml \
    --train_data data/train.data.list \
    --cv_data data/dev.data.list \
    --model llm \
    --model_dir `pwd`/exp/music_generation/llm/ \
    --tensorboard_dir `pwd`/tensorboard/music_generation/llm/ \
    --ddp.dist_backend "nccl" \
    --num_workers 8 \
    --prefetch 100 \
    --pin_memory \
    --deepspeed_config ./conf/ds_stage2.json \
    --deepspeed.save_states model+optimizer \
    --fp16

Here is an example code to train flow matching model, does not support FP16 training.

torchrun --nnodes=1 --nproc_per_node=8 \
    --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
    inspiremusic/bin/train.py \
    --train_engine "torch_ddp" \
    --config conf/inspiremusic.yaml \
    --train_data data/train.data.list \
    --cv_data data/dev.data.list \
    --model flow \
    --model_dir `pwd`/exp/music_generation/flow/ \
    --tensorboard_dir `pwd`/tensorboard/music_generation/flow/ \
    --ddp.dist_backend "nccl" \
    --num_workers 8 \
    --prefetch 100 \
    --pin_memory \
    --deepspeed_config ./conf/ds_stage2.json \
    --deepspeed.save_states model+optimizer

Inference

Here is an example script to quickly do model inference.

cd InspireMusic/examples/music_generation/
bash infer.sh

Here is an example code to run inference with normal mode, i.e., with flow matching model for text-to-music and music continuation tasks.

pretrained_model_dir = "./pretrained_models/InspireMusic/"
for task in 'text-to-music' 'continuation'; do
  python inspiremusic/bin/inference.py --task $task \
      --gpu 0 \
      --config conf/inspiremusic.yaml \
      --prompt_data data/test/parquet/data.list \
      --flow_model $pretrained_model_dir/flow.pt \
      --llm_model $pretrained_model_dir/llm.pt \
      --music_tokenizer $pretrained_model_dir/music_tokenizer \
      --wavtokenizer $pretrained_model_dir/wavtokenizer \
      --result_dir `pwd`/exp/inspiremusic/${task}_test \
      --chorus verse \
      --min_generate_audio_seconds 8 \
      --max_generate_audio_seconds 30 
done

Here is an example code to run inference with fast mode, i.e., without flow matching model for text-to-music and music continuation tasks.

pretrained_model_dir = "./pretrained_models/InspireMusic/"
for task in 'text-to-music' 'continuation'; do
  python inspiremusic/bin/inference.py --task $task \
      --gpu 0 \
      --config conf/inspiremusic.yaml \
      --prompt_data data/test/parquet/data.list \
      --flow_model $pretrained_model_dir/flow.pt \
      --llm_model $pretrained_model_dir/llm.pt \
      --music_tokenizer $pretrained_model_dir/music_tokenizer \
      --wavtokenizer $pretrained_model_dir/wavtokenizer \
      --result_dir `pwd`/exp/inspiremusic/${task}_test \
      --chorus verse \
      --fast \
      --min_generate_audio_seconds 8 \
      --max_generate_audio_seconds 30 
done

Roadmap

  • 2024/12

    • 75Hz InspireMusic-Base model for music generation
  • 2025/01

    • Support to generate 48kHz
    • 75Hz InspireMusic-1.5B model for music generation
    • 75Hz InspireMusic-1.5B-Long model for long-form music generation
  • 2025/02

    • Support song generation task
    • 75Hz InspireSong model for song generation
  • 2025/03

    • Support audio generation task
    • 75Hz InspireAudio model for music and audio generation
  • TBD

    • 25Hz InspireMusic model
    • Support 48kHz stereo audio
    • Streaming inference mode support
    • Support more instruction mode, multi-lingual instructions
    • InspireSong trained with more multi-lingual data
    • More...

Friend Links

Checkout some awesome Github repositories from Tongyi Lab, Alibaba Group.

Demo Demo Demo

Community & Discussion

  • Please support our community project 🌟 by starring it on GitHub 🙏
  • Welcome to join our DingTalk and WeChat groups to share and discuss algorithms, technology, and user experience feedback. You may scan the following QR codes to join our official chat groups accordingly.

FunAudioLLM in DingTalk InspireMusic in WeChat
Light Light

  • Github Discussion. Best for sharing feedback and asking questions.
  • GitHub Issues. Best for bugs you encounter using InspireMusic, and feature proposals.

Acknowledge

  1. We borrowed a lot of code from CosyVoice[3].
  2. We borrowed a lot of code from WavTokenizer.
  3. We borrowed a lot of code from AcademiCodec.
  4. We borrowed a lot of code from FunASR.
  5. We borrowed a lot of code from FunCodec.
  6. We borrowed a lot of code from Matcha-TTS.
  7. We borrowed a lot of code from WeNet.

References

[1] Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, Zhou Zhao, WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, The Thirteenth International Conference on Learning Representations, 2025.

[2] Yang, Dongchao, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, and Yuexian Zou, Hifi-codec: Group-residual vector quantization for high fidelity audio codec, arXiv preprint arXiv:2305.02765, 2023.

[3] Du, Zhihao, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.