This repository supports the SIQ test for voice understanding LLMs, including two main types: cascaded (ASR+LLM) and mulitmodal LLMs. SIQ involves three levels of tests: remember, understand and apply and the final SIQ is computed and normalized among three levels.
We use whisper-large-v3
on dataset medasr
as a demo.
python asr_inference.py --dataset "medasr" --model_name "whisper-large-v3"
This will save the ASR results to ASR_results/subset
and we use WER as the score.
We prompt meta-llama/Llama-3.1-8B-Instruct
to respond to both ground-truth transcriptions and ASR results to get two last layer hidden states and compute the cosine similarity between them as the score.
python llm_respond.py --dataset "medasr" --asr_model "whisper-large-v3" --asr_input "whisper_v3_1best"
This will save the llm_respond results to llm_respond_results/subset
.
Apply level denotes how well can models answer the input speech-related questions. We already generated question-answer (QA) pairs for each input audio at QA_data
. For cascaded ASR + LLM, we use Qwen2-7b-instruct
to answer questions based on ASR results:
python answer_qa.py --dataset "medasr" --asr_model "whisper-large-v3" --asr_input "whisper_v3_1best" --answer_model "qwen2-7b" "Qwen/Qwen2-7B-Instruct"
This will save the QA results to QA_results/subset
.
We show a demo with Qwen2-audio-instruct
on dataset medasr
.
We use the end-to-end framework for level 1 and level 3, different from cascaded ASR + LLM,
python end2end_asr.py --dataset "medasr" --asr_model "qwen2-audio"
This will save both ASR results and QA results
For level 2, the code is same as the cascaded one:
python llm_respond.py --dataset "medasr" --asr_model "qwen2-audio" --asr_input "qwen2-audio"
Final SIQ is computed considering the normalization among levels and among models (e.g., the difficulties of each example will be decided by the model performance)
To compute the final SIQ, we first preprocess all-level results by:
python score_stat.py --dataset "medasr" --asr_model "whisper-large-v3" --asr_input "whisper_v3_1best" --llm "qwen2-7b" --llm_name "Qwen/Qwen2-7B-Instruct"
for whisper-large-v3
and
python score_stat.py --dataset "medasr" --asr_model "qwen2-audio" --asr_input "qwen2-audio" --llm "qwen2-audio" --llm_name "qwen2-audio"
for whisper-large-v3
.
After we preprocess all models' results, we can run:
python compute_IQ.py
to derive the final SIQ. Note that for single model, the normalization among models will be ignored.
Here are SIQ results for some popular models.