This repository is a curated collection of research papers focused on the development, implementation, and evaluation of language models for audio data. Our goal is to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. Contributions and suggestions for new papers are highly encouraged!
Date | Model | Key Affiliations | Paper | Link |
---|---|---|---|---|
2024-10 | SPIRIT LM | Meta | SPIRIT LM: Interleaved Spoken and Written Language Model | Paper / Code / Project |
2024-10 | DiVA | Georgia Tech, Stanford | Distilling an End-to-End Voice Assistant Without Instruction Training Data | Paper / Project |
2024-09 | Moshi | Kyutai | Moshi: a speech-text foundation model for real-time dialogue | Paper / Code |
2024-09 | LLaMA-Omni | CAS | LLaMA-Omni: Seamless Speech Interaction with Large Language Models | Paper / Code |
2024-09 | Ultravox | fixie-ai | GitHub Open Source | Code |
2024-08 | Mini-Omni | Tsinghua | Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming | Paper / Code |
2024-08 | Typhoon-Audio | Typhoon | Typhoon-Audio Preview Release | Page |
2024-08 | USDM | SNU | Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation | Paper |
2024-08 | MooER | Moore Threads | MooER: LLM-based Speech Recognition and Translation Models from Moore Threads | Paper / Code |
2024-07 | GAMA | UMD | GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities | Paper / Code |
2024-07 | LLaST | CUHK-SZ | LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models | Paper / Code |
2024-07 | CompA | University of Maryland | CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models | Paper / Code / Project |
2024-07 | Qwen2-Audio | Alibaba | Qwen2-Audio Technical Report | Paper / Code |
2024-07 | FunAudioLLM | Alibaba | FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs | Paper / Code / Demo |
2024-06 | BESTOW | NVIDIA | BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 | Paper |
2024-06 | DeSTA | NTU-Taiwan, Nvidia | DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment | Paper / Code |
2024-05 | AudioChatLlama | Meta | AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs | Paper |
2024-05 | Audio Flamingo | Nvidia | Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities | Paper / Code |
2024-05 | SpeechVerse | AWS | SpeechVerse: A Large-scale Generalizable Audio Language Model | Paper |
2024-04 | SALMONN | Tsinghua | SALMONN: Towards Generic Hearing Abilities for Large Language Models | Paper / Code / Demo |
2024-03 | WavLLM | CUHK | WavLLM: Towards Robust and Adaptive Speech Large Language Model | Paper / Code |
2024-02 | LTU | MIT | Listen, Think, and Understand | Paper / Code |
2024-02 | SLAM-LLM | MoE Key Lab of Artificial Intelligence | An Embarrassingly Simple Approach for LLM with Strong ASR Capacity | Paper / Code |
2024-01 | Pengi | Microsoft | Pengi: An Audio Language Model for Audio Tasks | Paper / Code |
2023-12 | Qwen-Audio | Alibaba | Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models | Paper / Code / Demo |
2023-12 | LTU-AS | MIT | Joint Audio and Speech Understanding | Paper / Code / Demo |
2023-10 | Speech-LLaMA | Microsoft | On decoder-only architecture for speech-to-text and large language model integration | Paper |
2023-10 | UniAudio | CUHK | An Audio Foundation Model Toward Universal Audio Generation | Paper / Code / Demo |
2023-09 | LLaSM | LinkSoul.AI | LLaSM: Large Language and Speech Model | Paper / Code |
2023-06 | AudioPaLM | AudioPaLM: A Large Language Model that Can Speak and Listen | Paper / Demo | |
2023-05 | VioLA | Microsoft | VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation | Paper |
2023-05 | SpeechGPT | Fudan | SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities | Paper / Code / Demo |
2023-04 | AudioGPT | Zhejiang Uni | AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head | Paper / Code |
2022-09 | AudioLM | AudioLM: a Language Modeling Approach to Audio Generation | Paper / Demo |
Date | Model | Key Affiliations | Paper | Link |
---|---|---|---|---|
2024-09 | EMOVA | HKUST | EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions | Paper / Demo |
2023-11 | CoDi-2 | UC Berkeley | CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation | Paper / Code / Demo |
2023-06 | Macaw-LLM | Tencent | Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration | Paper / Code |
Date | Name | Key Affiliations | Paper | Link |
---|---|---|---|---|
2024-10 | SpeechEmotionLlama | MIT, Meta | Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech | Paper |
2024-09 | AudioBERT | Postech | AudioBERT: Audio Knowledge Augmented Language Model | Paper / Code |
2024-09 | MoWE-Audio | A*STAR | MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders | Paper |
2024-09 | - | Tsinghua SIGS | Comparing Discrete and Continuous Space LLMs for Speech Recognition | Paper |
2024-07 | - | NTU-Taiwan, Meta | Investigating Decoder-only Large Language Models for Speech-to-text Translation | Paper |
2024-06 | Speech ReaLLM | Meta | Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time | Paper |
2023-09 | Segment-level Q-Former | Tsinghua | Connecting Speech Encoder and Large Language Model for ASR | Paper |
2023-07 | - | Meta | Prompting Large Language Models with Speech Recognition Abilities | Paper |
Date | Name | Key Affiliations | Paper | Link |
---|---|---|---|---|
2024-05 | VoiceJailbreak | CISPA | Voice Jailbreak Attacks Against GPT-4o | Paper |
Date | Name | Key Affiliations | Paper | Link |
---|---|---|---|---|
2024-10 | VoiceBench | NUS | VoiceBench: Benchmarking LLM-Based Voice Assistants | Paper / Code |
2024-07 | AudioEntailment | CMU, Microsoft | Audio Entailment: Assessing Deductive Reasoning for Audio Understanding | Paper / Code |
2024-06 | Audio Hallucination | NTU-Taiwan | Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models | Paper / Code |
2024-06 | AudioBench | A*STAR, Singapore | AudioBench: A Universal Benchmark for Audio Large Language Models | Paper / Code / LeaderBoard |
2024-05 | AIR-Bench | ZJU, Alibaba | AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension | Paper / Code |
2024-08 | MuChoMusic | UPF, QMUL, UMG | MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models | Paper / Code |
2023-09 | Dynamic-SUPERB | NTU-Taiwan, etc. | Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech | Paper / Code |
Audio Models are different from Audio Large Language Models.
Date | Name | Key Affiliations | Paper | Link |
---|---|---|---|---|
2024-09 | Salmon | Hebrew University of Jerusalem | A Suite for Acoustic Language Model Evaluation | Paper / Code |