Chuangtao Chen1,
Grace Li Zhang2,
XunZhao Yin3,
Cheng Zhuo3,
Ulf Schlichtmann1,
Bing Li1
1Technical University of Munich
2Technical University of Darmstadt
3Zhejiang Univerity
(a) LiveMind inference with Llama-3-70B model; (b) LiveMind collaborative inference with Llama-3-70B and Llama-3-8B models; (c) Conventional CoT inference.
preview.mp4
A Demo with gradio of conventional Chain-of-Thought inference (left) and LiveMind simultanous inference (right) with streaming input. See `Playground` section for more information.
Install required packages:
pip install datasets alive_progress nltk
Before running the script, you need to change the following configurations in live_mind/config.py
to set the LLMs and datasets:
MMLU_PRO_PATH
: path to the MMLU-Pro dataset, the path should contains.parquet
dataset files;- Implement the
get_model
method: you can use your own model here as long it has the required method (seelive_mind/config.py
); - You can also use the
get_model_vllm_example
implementation; - To use the
get_model_vllm_example
function, you need to specify the pathsLLAMA_3_8B_PATH
andLLAMA_3_70B_PATH
. Aconfig.json
file andtokenizer.json
file should be found in these paths. Besides, make sure the packagesvllm
andtransformers
are installed.
pip install vllm transformers
- Models used in the paper: Llama-3-70B, Llama-3-8B
Run the following commands to reproduce the results of real-time estimation:
python run_solver.py --model llama-3-70b --use_lm --output_file ./output/mmlu_pro/time_info/llama_3_70b_lm/all.json
python run_solver.py --model llama-3-70b --output_file ./output/mmlu_pro/time_info/llama_3_70b_base/all.json
python run_solver.py --model llama-3-70b --assist_model llama-3-8b --use_lm --action_set SAS --output_file ./output/mmlu_pro/time_info/llama_3_70b_w_8b_lm/all.json
python run_solver.py --model llama-3-8b --output_file ./output/mmlu_pro/time_info/llama_3_8b_base/all.json
Run the following commands to reproduce the results of batched inference:
python run_batch_solver.py --model llama-3-70b --use_lm --output_file ./output/mmlu_pro/batched/llama_3_70b_lm/all.json
python run_batch_solver.py --model llama-3-70b --output_file ./output/mmlu_pro/batched/llama_3_70b_base/all.json
python run_batch_solver.py --model llama-3-70b --use_lm --assist_model llama-3-8b --action_set SAS --output_file ./output/mmlu_pro/batched/llama_3_70b_w_8b_lm/all.json
python run_batch_solver.py --model llama-3-8b --output_file ./output/mmlu_pro/batched/llama_3_8b_base/all.json
Run the following commands to analyze the output files and reproduce the experiment results:
python analyze_time_info.py ./output/mmlu_pro/time_info/llama_3_70b_lm/all.json
python analyze_time_info.py ./output/mmlu_pro/time_info/llama_3_70b_base/all.json
python analyze_time_info.py ./output/mmlu_pro/time_info/llama_3_70b_w_8b_lm/all.json
python analyze_time_info.py ./output/mmlu_pro/time_info/llama_3_8b_base/all.json
This step will create two csv files: timeinfo_by_category.csv
and timeinfo_by_len.csv
at each folder with the all.json
file.
python analyze_batched.py ./output/mmlu_pro/batched/llama_3_70b_lm/all.json
python analyze_batched.py ./output/mmlu_pro/batched/llama_3_70b_base/all.json
python analyze_batched.py ./output/mmlu_pro/batched/llama_3_70b_w_8b_lm/all.json
python analyze_batched.py ./output/mmlu_pro/batched/llama_3_8b_base/all.json
This step will create two csv files: timeinfo_by_category.csv
and timeinfo_by_len.csv
at each folder with the all.json
file.
Run the following commands to reproduce the results present in Sec. 4.4 in the paper:
python analyze_actions.py ./output/mmlu_pro/batched/llama_3_70b_lm/all.json
python analyze_actions.py ./output/mmlu_pro/batched/llama_3_70b_w_8b_lm/all.json
This step will create two csv files: actions_per_step
and actions_per_len
in these two folders, corresponding to the data presented in Fig. 8.
To reproduce the results in Table 2, first run the batched inference with the following configurations:
python run_batch_solver.py --model llama-3-8b --use_lm --action_set CAS --output_file ./output/mmlu_pro/ablation/llama_3_8b_lm_comp/all.json
python run_batch_solver.py --model llama-3-8b --use_lm --action_set SAS --output_file ./output/mmlu_pro/ablation/llama_3_8b_lm_simp/all.json
python run_batch_solver.py --model llama-3-8b --use_lm --assist_model llama-3-70b --action_set CAS --output_file ./output/mmlu_pro/ablation/llama_3_8b_w_70b_lm_comp/all.json
python run_batch_solver.py --model llama-3-8b --use_lm --assist_model llama-3-70b --action_set SAS --output_file ./output/mmlu_pro/ablation/llama_3_8b_w_70b_lm_simp/all.json
python run_batch_solver.py --model llama-3-70b --use_lm --action_set SAS --output_file ./output/mmlu_pro/ablation/llama_3_70b_lm_simp/all.json
python run_batch_solver.py --model llama-3-70b --use_lm --assist_model llama-3-8b --action_set CAS --output_file ./output/mmlu_pro/ablation/llama_3_70b_w_8b_lm_simp/all.json
Then run python analyze_batched.py **/all.json
to report the results.
We impleted a demo with gradio and textual. In the demos, you can interact with LLMs through the LiveMind framework, allowing the LLM to take actions as you type in the text box!
To run the demo, you need to install vllm
and transformers
:
pip install vllm transformers
To run the demo in gradio, you need to install gradio
; to run the demo in textual
, you need to install textual
(you can select either):
pip install textual
pip install gradio
then, set the model paths in playground/config.py
with your own model paths. The paths should contain a config.json
file and a tokenizer.json
file. You can download the models from huggingface. For example, Llama-3-70B, Llama-3-8B.
Run the demo in gradio, use
python run_playground.py --gradio --model llama-3-70b --use_lm
Type your message in the text box and press enter to send the message, you can change wether to use the LiveMind
(LM) framework by clicking the checkbox.
In LiveMind
inference mode, the model can perform inferences when you are typing. The actions performed are displayed in the Actions textbox. You can also include --log
when launching the demo, then the actions will be logged in the log file playground/log.log
.
You can use --assist_model [model_name]
to use a different model as the output model, as mentioned in the paper.
You can also run the demo implemented with textual in your terminal, simply use:
python run_playground.py --textual --model llama-3-70b --use_lm
when using --use_lm
, the model is running in LiveMind mode, which means it can inference when you are typing. Click the send button to send the message.
The actions performed by the LLM are not displayed in the chat window. To see the model actions, include --log
when launching the demo, then the actions will be logged in the log file playground/log.log
.
If you do not include --use_lm
, the chat will be running in normal mode without the LiveMind framework.
To cite our work:
@article{chen2024livemind,
title={{LiveMind}: Low-latency Large Language Models with Simultaneous Inference},
author={Chuangtao Chen and Grace Li Zhang and Xunzhao Yin and Cheng Zhuo and Ulf Schlichtmann and Bing Li},
journal={arXiv preprint arXiv:2406.14319},
year={2024},
}