AudioLLMs

This repository is a curated collection of research papers focused on the development, implementation, and evaluation of language models for audio data. Our goal is to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. Contributions and suggestions for new papers are highly encouraged!

Models

Date	Model	Key Affiliations	Paper	Link
2024-10	SPIRIT LM	Meta	SPIRIT LM: Interleaved Spoken and Written Language Model	Paper / Code / Project
2024-10	DiVA	Georgia Tech, Stanford	Distilling an End-to-End Voice Assistant Without Instruction Training Data	Paper / Project
2024-09	Moshi	Kyutai	Moshi: a speech-text foundation model for real-time dialogue	Paper / Code
2024-09	LLaMA-Omni	CAS	LLaMA-Omni: Seamless Speech Interaction with Large Language Models	Paper / Code
2024-09	Ultravox	fixie-ai	GitHub Open Source	Code
2024-08	Mini-Omni	Tsinghua	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	Paper / Code
2024-08	Typhoon-Audio	Typhoon	Typhoon-Audio Preview Release	Page
2024-08	USDM	SNU	Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation	Paper
2024-08	MooER	Moore Threads	MooER: LLM-based Speech Recognition and Translation Models from Moore Threads	Paper / Code
2024-07	GAMA	UMD	GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities	Paper / Code
2024-07	LLaST	CUHK-SZ	LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models	Paper / Code
2024-07	CompA	University of Maryland	CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models	Paper / Code / Project
2024-07	Qwen2-Audio	Alibaba	Qwen2-Audio Technical Report	Paper / Code
2024-07	FunAudioLLM	Alibaba	FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs	Paper / Code / Demo
2024-06	BESTOW	NVIDIA	BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5	Paper
2024-06	DeSTA	NTU-Taiwan, Nvidia	DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment	Paper / Code
2024-05	AudioChatLlama	Meta	AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs	Paper
2024-05	Audio Flamingo	Nvidia	Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities	Paper / Code
2024-05	SpeechVerse	AWS	SpeechVerse: A Large-scale Generalizable Audio Language Model	Paper
2024-04	SALMONN	Tsinghua	SALMONN: Towards Generic Hearing Abilities for Large Language Models	Paper / Code / Demo
2024-03	WavLLM	CUHK	WavLLM: Towards Robust and Adaptive Speech Large Language Model	Paper / Code
2024-02	LTU	MIT	Listen, Think, and Understand	Paper / Code
2024-02	SLAM-LLM	MoE Key Lab of Artificial Intelligence	An Embarrassingly Simple Approach for LLM with Strong ASR Capacity	Paper / Code
2024-01	Pengi	Microsoft	Pengi: An Audio Language Model for Audio Tasks	Paper / Code
2023-12	Qwen-Audio	Alibaba	Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models	Paper / Code / Demo
2023-12	LTU-AS	MIT	Joint Audio and Speech Understanding	Paper / Code / Demo
2023-10	Speech-LLaMA	Microsoft	On decoder-only architecture for speech-to-text and large language model integration	Paper
2023-10	UniAudio	CUHK	An Audio Foundation Model Toward Universal Audio Generation	Paper / Code / Demo
2023-09	LLaSM	LinkSoul.AI	LLaSM: Large Language and Speech Model	Paper / Code
2023-06	AudioPaLM	Google	AudioPaLM: A Large Language Model that Can Speak and Listen	Paper / Demo
2023-05	VioLA	Microsoft	VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation	Paper
2023-05	SpeechGPT	Fudan	SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	Paper / Code / Demo
2023-04	AudioGPT	Zhejiang Uni	AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head	Paper / Code
2022-09	AudioLM	Google	AudioLM: a Language Modeling Approach to Audio Generation	Paper / Demo

Models (language + audio + other modalities)

Date	Model	Key Affiliations	Paper	Link
2024-09	EMOVA	HKUST	EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	Paper / Demo
2023-11	CoDi-2	UC Berkeley	CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	Paper / Code / Demo
2023-06	Macaw-LLM	Tencent	Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration	Paper / Code

Methodology

Date	Name	Key Affiliations	Paper	Link
2024-10	SpeechEmotionLlama	MIT, Meta	Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech	Paper
2024-09	AudioBERT	Postech	AudioBERT: Audio Knowledge Augmented Language Model	Paper / Code
2024-09	MoWE-Audio	A*STAR	MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders	Paper
2024-09	-	Tsinghua SIGS	Comparing Discrete and Continuous Space LLMs for Speech Recognition	Paper
2024-07	-	NTU-Taiwan, Meta	Investigating Decoder-only Large Language Models for Speech-to-text Translation	Paper
2024-06	Speech ReaLLM	Meta	Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time	Paper
2023-09	Segment-level Q-Former	Tsinghua	Connecting Speech Encoder and Large Language Model for ASR	Paper
2023-07	-	Meta	Prompting Large Language Models with Speech Recognition Abilities	Paper

Adversarial Attacks

Date	Name	Key Affiliations	Paper	Link
2024-05	VoiceJailbreak	CISPA	Voice Jailbreak Attacks Against GPT-4o	Paper

Evaluation

Date	Name	Key Affiliations	Paper	Link
2024-10	VoiceBench	NUS	VoiceBench: Benchmarking LLM-Based Voice Assistants	Paper / Code
2024-07	AudioEntailment	CMU, Microsoft	Audio Entailment: Assessing Deductive Reasoning for Audio Understanding	Paper / Code
2024-06	Audio Hallucination	NTU-Taiwan	Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models	Paper / Code
2024-06	AudioBench	A*STAR, Singapore	AudioBench: A Universal Benchmark for Audio Large Language Models	Paper / Code / LeaderBoard
2024-05	AIR-Bench	ZJU, Alibaba	AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension	Paper / Code
2024-08	MuChoMusic	UPF, QMUL, UMG	MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models	Paper / Code
2023-09	Dynamic-SUPERB	NTU-Taiwan, etc.	Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech	Paper / Code

Audio Model

Audio Models are different from Audio Large Language Models.

Evaluation

Date	Name	Key Affiliations	Paper	Link
2024-09	Salmon	Hebrew University of Jerusalem	A Suite for Acoustic Language Model Evaluation	Paper / Code

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioLLMs

Models

Models (language + audio + other modalities)

Methodology

Adversarial Attacks

Evaluation

Audio Model

Evaluation

About

Releases

Packages

Contributors 5

AudioLLMs/AudioLLM

Folders and files

Latest commit

History

Repository files navigation

AudioLLMs

Models

Models (language + audio + other modalities)

Methodology

Adversarial Attacks

Evaluation

Audio Model

Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Packages