This is the official repository of the paper "Learning to Reason from Feedback at Test-Time".
TL;DR: We introduce a novel paradigm to address the challenge of exploiting test-time feedback for improved reasoning performance, Feedback-based Test-Time Training (FTTT), which formulates feedback utilization as a training problem. We additionally propose a learnable test-time optimizer, OpTune, to make FTTT more effective.
- [2025-02-25] ๐ฅ We release the code of our paper. The detailed instruction can be found below.
Our implementation is based on python=3.12
. Follow the commands below to prepare the Python environment (we recommend using Miniconda to setup the environment):
git clone https://github.com/LaVi-Lab/FTTT.git
cd FTTT
conda install pytorch==2.4.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
-
We mainly train and evaluate on MATH500, GSM8K, MBPP and HumanEval. You can download the raw data via the links below or choose to download our packed version on Google Drive:
Dataset Link MATH500 GSM8K MBPP ๐ค HumanEval ๐ค
Note
All downloaded datasets should be stored in a folder named datasets
.
-
Another important asset of this repo is the question indices of the test sets, as we only evaluate hard questions that cannot be solved by the raw LLM initially. You can download these indices from Google Drive and unzip them to the folder
metadata
.Reproducing question indices
If you want to reproduce the question indices by yourself, you can perform greedy decoding on each dataset:
# Greedy decoding bash prelim/scripts/greedy.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]
This command will automatically save a file
{llama|mistral}_{MATH500|GSM8K|MBPP|HumanEval}_correct_cases.json
under the current directory, which contains the indices of questions that can be correctly answered by greedy decoding. -
The training data of OpTune consists of solutions generated from the raw LLM. We provide the Google Drive link to download our training data.
Reproducing OpTune training data
You can generate the training data for OpTune by yourself:
bash optim/scripts/gen.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP]
Finally, all data should be organized as follows:
FTTT
|-- datasets [Raw training & evaluation data here!]
|-- metadata [Test set indices here!]
|-- api
|-- optim
| |-- cache [OpTune Training data here!]
| |-- data
| |-- models
| |-- scripts
| |-- ...
|-- prelim
|-- .gitignore
|-- LICENSE
|-- README.md
|-- requirements.txt
For the code generation task, we deploy a local API service to run the generated code and check if it passes all test cases. We use the following commands to launch the service:
cd api
uvicorn oj_api:app --host 0.0.0.0 --port 9999 --workers 8 --limit-concurrency 16
By default, this codebase will send requests to http://localhost:9999
for evaluating code completion datasets. If you want to use another port or host, please modify --host
and --port
above and add export OJ_API=YOUR_NEW_URL
to our scripts to make the new URL effective.
Below are commands to reproduce all of our experiments on FTTT and other test-time scaling baselines:
# Revision
bash prelim/scripts/revision.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]
# Beam Search
bash prelim/scripts/beam_search.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]
# Best-of-N
bash prelim/scripts/best_of_n.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# Self-Refine
bash prelim/scripts/self_refine.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# Self-Consistency
bash prelim/scripts/self_consistency.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# FTTT (w/o or w/ self-reflection)
bash prelim/scripts/fttt.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100] [FTTT|FTTT+]
Note
All output logs will be stored in the folder outputs
under the current directory.
Important
Our codebase uses ๐คHuggingFace transformers
to download & load pretrained models, including Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3. If you want to load the models from a local directory, please update the model name in our scripts to your local directory, e.g., meta-llama/Llama-3.1-8B-Instruct
=> /YOUR/PATH/TO/MODEL_DIRECTORY
.
You can use the following commands to reproduce the training of PEFT baselines as well as OpTune:
# PEFT baselines
bash optim/scripts/baseline.sh [MATH500|GSM8K|MBPP] 42 [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [FT|LoRA|Adapter|IA3|LNTuning]
# OpTune
bash optim/scripts/train.sh [MATH500|GSM8K|MBPP] 42 [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3]
Note
All checkpoints will be stored in the folder outputs
under the current directory.
Tip
OpTune currently does not support model architectures other than Llama and Mistral, as it relies on the modification over the original ๐คHuggingFace implementation to inject weight updates during inference. If you want to support other model architectures, please add an implementation to optim/models
and modify run.py
and evaluator.py
to use the correct architecture.
Warning
Although our implementation of OpTune supports distributed data parallelism for multiple GPU training, this feature is not well tested and we suggest training in a single GPU.
After training the models, you can use the following commands to evaluate the trained PEFT baselines and OpTune:
# PEFT baselines
bash optim/scripts/baseline_scale.sh ${PEFT_CHECKPOINT_PATH}
# OpTune
bash optim/scripts/scale.sh ${OPTUNE_CHECKPOINT_PATH}
Typically, we use the checkpoints in the last epoch for evaluation.
To reproduce our OpTune performance, you can download our checkpoints for each dataset from Google Drive and run the evaluation using the commands above:
Dataset | Llama-3.1-8B-Instruct | Mistral-7B-Instruct-v0.3 |
---|---|---|
MATH500 | ||
GSM8K | ||
MBPP |
Some of our implementations are based on MEND. We would like to thank the authors for sharing their code.
Please cite our paper if you find our work useful:
@misc{li2025learning,
title={Learning to Reason from Feedback at Test-Time},
author={Yanyang Li and Michael Lyu and Liwei Wang},
year={2025},
eprint={2502.15771},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.15771},
}