Training codes is released in xtuner, and more details will be completed in the near future. Thank you for your attention!
We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation~(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework
Model | Speech Encoder | LLM | HuggingFace | ModelScope |
---|---|---|---|---|
LLaST-2B | Whisper-Large | TinyLlama | TBD | TBD |
LLaST-8B | Whisper-Large | Llama2-7B-Instruct | TBD | TBD |
-
Download data from CommonVoice
-
Prepare tsv data as follows:
covost2/tsv
├── covost_v2.de_en.dev.tsv
├── covost_v2.de_en.test.tsv
- Prepare the multi-lingual data as the follows
covost/audio
├── de
├── en
├── es
├── fr
├── it
├── ja
└── zh-CN
- Prepare the audio data as the follows:
covost2/audio/fr/clips_16k
├── common_voice_fr_20241860.wav
├── common_voice_fr_20241864.wav
├── common_voice_fr_20241868.wav
├── common_voice_fr_20241872.wav
└── common_voice_fr_20241875.wav
- Install xtuner
git clone git@github.com:ChenX17/xtuner.git
cd xtuner
git checkout add_llast
- Training
export XTUNER_DATASET_TIMEOUT=120
export HF_EVALUATE_OFFLINE=1
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
python xtuner/tools/train.py worksapce/configs/llast_2b_tinyllama_chat.py --deepspeed deepspeed_zero2
export HF_EVALUATE_OFFLINE=1
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
python xtuner/tools/test.py worksapce/configs/llast_2b_tinyllama_chat.py --checkpoint work_dir/xxxx/epoch_1.pth/mp_rank_00_model_states.pt --laucher slurm
@inproceedings{chen2024llast,
title = {LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models},
author = {Chen, Xi and Zhang, Songyang and Bai, Qibing and Chen, Kai and Nakamura, Satoshi},
booktitle = {Findings of the Association for Computational Linguistics (ACL),},
year = {2024}
}