LLM Math Evaluation Harness

A unified, precise, and extensible toolkit to benchmark LLMs on various mathematical tasks 🧮✨.

🔴🚀 Important Notice: We've identified variances above 5% in results from diverse math evaluation frameworks. To ensure fair and standardized comparisons across research, our toolkit strives to harmonize evaluation methods, promoting consistent and reliable math evaluation.

🌟 In Practice: Esteemed projects like ToRA (ICLR'24) and DeepSeek-Coder have leveraged this suite!

🙌 Statement: This project is a fork of math-evaluation-harness, which is no longer actively maintained by the original author. We have forked it to ensure continued updates (e.g., more features, more automation, more benchamrk) and maintenance under GAIR Lab.

Features:

Models: Seamless compatibility with models from Hugging Face 🤗 and vLLM.
Datasets: An extensive array of datasets including minerva_math, math, math_oai, gsm8k, gsm_hard, svamp, asdiv, mawps, tabmwp, finqa, theorem_qa, bbh, mmlu_stem, sat_math, mathqa, hungarian_exam.
Prompts: Diverse prompting paradigms, from Direct to Chain-of-Thought (CoT), Program-of-Thought (PoT/PAL), and Tool-Integrated Reasoning (ToRA).

🚀 Getting Started

⚙️ Environment Setup

Option 1: Conda

conda create -n math_eval python=3.10
conda activate math_eval

Option 2: Docker

We suggest using vLLM docker directly:

docker run --network host --cap-add=SYS_ADMIN --privileged -d \
    --entrypoint '' --name vllm \
    --runtime nvidia --gpus all \
    --security-opt apparmor:unconfined \
    --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /mnt:/mnt \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    sleep infinity

Install

git clone https://github.com/GAIR-NLP/math-evaluation-harness.git
cd math-evaluation-harness
pip install -r requirements.txt

⚖️ Evaluation

Configure model and data settings in scripts/run_math_eval.sh, and set the PROMPT_TYPE variable accordingly:
- For base models, choose from: direct, cot, pal, or tool-integrated.
- For SFT models, your options include: tora, wizard_zs, deepseek-math, etc.
  - To add new models, update the construct_prompt function in utils.py to include your new prompt template.
Run the script:

If you want to evaluate a model on a specific dataset, you can specify the dataset name in the second argument, e.g., gsm8k or svamp. If you want to evaluate a model on all benchmarks, you can omit the second argument.

# evaluate one model only on one dataset or several datasets
bash scripts/run_eval.sh ${your_model_path} "dataset_name1,dataset_name2"
# evaluate one model only on all benchmarks
bash scripts/run_eval.sh ${your_model_path}
# evaluate a large model on all benchmarks, such as 70B, you need to specific the TP size
bash scripts/run_eval.sh ${your_model_path} 2

When you would like evaluate all saved intermediate ckpts during the language model training process, you can leverage the following command:

# only eval on one dataset or several datasets
bash auto_dir_run.sh ${your_model_folder_path} "dataset_name1,dataset_name2"
# eval on all benchmarks
bash auto_dir_run.sh ${your_model_folder_path}
# eval on all bencharks, such as 70B, you need to specific the TP size
bash auto_dir_run.sh ${your_model_folder_path} 2

Note that the your_model_folder_path should be like:

$ ls your_model_folder_path
checkpoint-1000/  checkpoint-2000/  checkpoint-3000/  checkpoint-4000/  checkpoint-5000/  checkpoint-6000/  checkpoint-7000/  checkpoint-8000/  checkpoint-9000/

Summarize the results automatically

# summarize all results of all intermediate ckpts in your_model_folder_path
python gather_results.py --do_all_ckpts --dir_path outputs/${your_model_folder_path}
# summarize results of one ckpt
python gather_results.py --do_one_ckpt --dir_path outputs/${your_model_folder_path}

This command will generate summary results and plot the results automatically in the your model folder. We provide some examples of the results in the outputs folder.

📊 Results

Base Models (CoT)

PROMPT_TYPE=cot

Model	Size	Data	Uniq. Token	Train Token	GSM8K	MATH¹	SVAMP	ASDiv	MAWPS	TAB²	MQA	MMLU STEM	SAT	AVG
1-2B Base Models
Tinyllama	1.1B	-	-	-	2.9	3.2	11.0	18.1	20.4	12.5	14.6	16.1	21.9	13.4
Phi-1.5	1.3B	-	-	-	32.4	4.2	43.4	53.1	66.2	24.4	14.3	21.8	18.8	31.0
Qwen1.5	1.8B	-	-	-	36.1	6.8	48.5	63.6	79.0	29.2	25.1	31.3	40.6	40.0
Gemma	2.0B	-	-	-	18.8	11.4	38.0	56.6	72.5	36.9	26.8	34.4	50.0	38.4
DeepSeekLLM	1.3B	OWM	14B	150B	11.5	8.9	-	-	-	-	-	29.6	31.3	-
DeepSeekMath	1.3B	-	120B	150B	23.8	13.6	-	-	-	-	-	33.1	56.3	-
Rho-Math	1.1B	OWM	14B	30B	36.2	15.6	52.1	67.0	83.9	29.0	32.5	23.3	28.1	40.9
>= 7B Base Models
LLaMA-2	7B		-	-	14.0	3.6	39.5	51.7	63.5	30.9	12.4	32.7	34.4	31.4
Mistral	7B		-	-	41.2	11.6	64.7	68.5	87.5	52.9	33.0	49.5	59.4	52.0
Minerva	8B	-	39B	164B	16.2	14.1	-	-	-	-	-	35.6	-	-
Minerva	62B	-	39B	109B	52.4	27.6	-	-	-	-	-	53.9	-	-
Minerva	540B	-	39B	26B	58.8	33.6	-	-	-	-	-	63.9	-	-
LLemma	7B	PPile	55B	200B	38.8	17.2	56.1	69.1	82.4	48.7	41.0	45.4	59.4	50.9
LLemma	34B	PPile	55B	50B	54.2	23.0	67.9	75.7	90.1	57.0	49.8	54.7	68.8	60.1
Intern-Math	7B	-	31B	125B	41.8	14.4	61.6	66.8	83.7	50.0	57.3	24.8	37.5	48.7
Intern-Math	20B	-	31B	125B	65.4	30.0	75.7	79.3	94.0	50.9	38.5	53.1	71.9	62.1
DeepSeekMath	7B	-	120B	500B	64.1	34.2	74.0	83.9	92.4	63.4	62.4	56.4	84.4	68.4
Rho-Math	7B	OWM	14B	10.5B	66.9	31.0	77.8	79.0	93.9	49.9	58.7	54.6	84.4	66.2

SFT Models (Code Interpreter)

PROMPT_TYPE=tora

Model	Size	SFT Data	GSM8k	MATH	SVAMP	ASDiv	MAWPS	TAB	GSM-Hard	AVG
GPT4-early (PAL)	-	-	94.2	51.8	94.8	92.6	97.7	95.9	77.6	86.4
MAmmoTH	70B	MI-260k	76.9	41.8	82.4	-	-	-	-	-
ToRA	7B	ToRA-69k	68.8	40.1	68.2	73.9	88.8	42.4	54.6	62.4
ToRA	70B	ToRA-69k	84.3	49.7	82.7	86.8	93.8	74.0	67.2	76.9
DeepSeekMath	7B	ToRA-69k	79.8	52.0	80.1	87.1	93.8	85.8	63.1	77.4
Rho-Math	1B	ToRA-69k	59.4	40.6	60.7	74.2	88.6	26.7	48.1	56.9
Rho-Math	7B	ToRA-69k	81.3	51.8	80.8	85.5	94.5	70.1	63.1	75.3

SFT Models (CoT)

PROMPT_TYPE=deepseek-math

Size	Model	GSM8k	MATH	SWAMP	ASDiv	MAWPS	AVG
7B	DeepSeek-Math-Instruct	82.4	45.8	83.5	90.1	95.7	79.5
	DeepSeek-Math-RL	88.3	50.0	87.2	92.0	95.5	82.6

🍀 Contributing

This project is still under active development. We welcome any contributions, including bug reports, feature requests, and pull requests.

☕️ References

https://github.com/microsoft/ToRA
https://github.com/openai/prm800k
https://github.com/wellecks/lm-evaluation-harness
https://github.com/deepseek-ai/DeepSeek-Math

Footnotes

We suggest utilizing the OpenAI test subset for evaluating MATH performance, since the original MATH test set has already been included in public training sets such as PRM800k. We use minerva_math prompt. ↩
abbreviations: TAB=tabmwp, MQA = mathqa, SAT = sat_math ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLM Math Evaluation Harness

Features:

🚀 Getting Started

⚙️ Environment Setup

Option 1: Conda

Option 2: Docker

Install

⚖️ Evaluation

📊 Results

Base Models (CoT)

SFT Models (Code Interpreter)

SFT Models (CoT)

🍀 Contributing

☕️ References

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLM Math Evaluation Harness

Features:

🚀 Getting Started

⚙️ Environment Setup

Option 1: Conda

Option 2: Docker

Install

⚖️ Evaluation

📊 Results

Base Models (CoT)

SFT Models (Code Interpreter)

SFT Models (CoT)

🍀 Contributing

☕️ References

Footnotes