Speech IQ Test (SIQ)

This repository supports the SIQ test for voice understanding LLMs, including two main types: cascaded (ASR+LLM) and mulitmodal LLMs. SIQ involves three levels of tests: remember, understand and apply and the final SIQ is computed and normalized among three levels.

Evaluation of cascaded (ASR+LLM)

We use whisper-large-v3 on dataset medasr as a demo.

Level 1: Remember

python asr_inference.py --dataset "medasr" --model_name "whisper-large-v3"

This will save the ASR results to ASR_results/subset and we use WER as the score.

Level 2: Understand

We prompt meta-llama/Llama-3.1-8B-Instruct to respond to both ground-truth transcriptions and ASR results to get two last layer hidden states and compute the cosine similarity between them as the score.

python llm_respond.py --dataset "medasr" --asr_model "whisper-large-v3" --asr_input "whisper_v3_1best"

This will save the llm_respond results to llm_respond_results/subset.

Level 3: Apply

Apply level denotes how well can models answer the input speech-related questions. We already generated question-answer (QA) pairs for each input audio at QA_data. For cascaded ASR + LLM, we use Qwen2-7b-instruct to answer questions based on ASR results:

python answer_qa.py --dataset "medasr" --asr_model "whisper-large-v3" --asr_input "whisper_v3_1best" --answer_model "qwen2-7b" "Qwen/Qwen2-7B-Instruct"

This will save the QA results to QA_results/subset.

Evaluation for multimodal LLMs

We show a demo with Qwen2-audio-instruct on dataset medasr.

Level 1 and level 3

We use the end-to-end framework for level 1 and level 3, different from cascaded ASR + LLM,

python end2end_asr.py --dataset "medasr" --asr_model "qwen2-audio"

This will save both ASR results and QA results

Level 2

For level 2, the code is same as the cascaded one:

python llm_respond.py --dataset "medasr" --asr_model "qwen2-audio" --asr_input "qwen2-audio"

Final SIQ

Final SIQ is computed considering the normalization among levels and among models (e.g., the difficulties of each example will be decided by the model performance)

To compute the final SIQ, we first preprocess all-level results by:

python score_stat.py --dataset "medasr" --asr_model "whisper-large-v3" --asr_input "whisper_v3_1best" --llm "qwen2-7b" --llm_name "Qwen/Qwen2-7B-Instruct"

for whisper-large-v3 and

python score_stat.py --dataset "medasr" --asr_model "qwen2-audio" --asr_input "qwen2-audio" --llm "qwen2-audio" --llm_name "qwen2-audio"

for whisper-large-v3.

After we preprocess all models' results, we can run:

python compute_IQ.py

to derive the final SIQ. Note that for single model, the normalization among models will be ignored.

Here are SIQ results for some popular models.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ASR_results/subset		ASR_results/subset
GER_results/subset		GER_results/subset
QA_data/subset		QA_data/subset
QA_results/subset		QA_results/subset
__pycache__		__pycache__
analysis		analysis
canary_infer		canary_infer
closed-set		closed-set
llm_respond_results/subset		llm_respond_results/subset
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
=0.4.1		=0.4.1
README.md		README.md
analyze.py		analyze.py
answer_qa.py		answer_qa.py
asr_inference.py		asr_inference.py
compute_IQ.py		compute_IQ.py
end2end_asr.py		end2end_asr.py
generate_qa.py		generate_qa.py
llm_respond.py		llm_respond.py
qa_eval.py		qa_eval.py
requirements.txt		requirements.txt
run_analysis.sh		run_analysis.sh
run_answer.sh		run_answer.sh
run_asr.sh		run_asr.sh
run_end2end.sh		run_end2end.sh
run_generate.sh		run_generate.sh
run_llmeval.sh		run_llmeval.sh
sbatch.sh		sbatch.sh
score_analyze.py		score_analyze.py
score_stat.py		score_stat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speech IQ Test (SIQ)

Evaluation of cascaded (ASR+LLM)

Level 1: Remember

Level 2: Understand

Level 3: Apply

Evaluation for multimodal LLMs

Level 1 and level 3

Level 2

Final SIQ

About

Uh oh!

Releases

Packages

Languages

ku-nlp/SpeechIQ

Folders and files

Latest commit

History

Repository files navigation

Speech IQ Test (SIQ)

Evaluation of cascaded (ASR+LLM)

Level 1: Remember

Level 2: Understand

Level 3: Apply

Evaluation for multimodal LLMs

Level 1 and level 3

Level 2

Final SIQ

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages