Learning to Reason from Feedback at Test-Time

This is the official repository of the paper "Learning to Reason from Feedback at Test-Time".

TL;DR: We introduce a novel paradigm to address the challenge of exploiting test-time feedback for improved reasoning performance, Feedback-based Test-Time Training (FTTT), which formulates feedback utilization as a training problem. We additionally propose a learnable test-time optimizer, OpTune, to make FTTT more effective.

🔔 Updates

[2025-02-25] 🔥 We release the code of our paper. The detailed instruction can be found below.

🛠️ Installation

Our implementation is based on python=3.12. Follow the commands below to prepare the Python environment (we recommend using Miniconda to setup the environment):

git clone https://github.com/LaVi-Lab/FTTT.git
cd FTTT
conda install pytorch==2.4.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt

💡 Preparation

⏬ Data

We mainly train and evaluate on MATH500, GSM8K, MBPP and HumanEval. You can download the raw data via the links below or choose to download our packed version on Google Drive:

Dataset Link

MATH500

GSM8K

MBPP 🤗

HumanEval 🤗

Note

All downloaded datasets should be stored in a folder named datasets.

Another important asset of this repo is the question indices of the test sets, as we only evaluate hard questions that cannot be solved by the raw LLM initially. You can download these indices from Google Drive and unzip them to the folder metadata.
Reproducing question indices

If you want to reproduce the question indices by yourself, you can perform greedy decoding on each dataset:
```
# Greedy decoding
bash prelim/scripts/greedy.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]
```
This command will automatically save a file {llama|mistral}_{MATH500|GSM8K|MBPP|HumanEval}_correct_cases.json under the current directory, which contains the indices of questions that can be correctly answered by greedy decoding.
The training data of OpTune consists of solutions generated from the raw LLM. We provide the Google Drive link to download our training data.
Reproducing OpTune training data

You can generate the training data for OpTune by yourself:
```
bash optim/scripts/gen.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP]
```

Finally, all data should be organized as follows:

FTTT
|-- datasets [Raw training & evaluation data here!]
|-- metadata [Test set indices here!]
|-- api
|-- optim
|   |-- cache [OpTune Training data here!]
|   |-- data
|   |-- models
|   |-- scripts
|   |-- ...
|-- prelim
|-- .gitignore
|-- LICENSE
|-- README.md
|-- requirements.txt

🌐 API

For the code generation task, we deploy a local API service to run the generated code and check if it passes all test cases. We use the following commands to launch the service:

cd api
uvicorn oj_api:app --host 0.0.0.0 --port 9999 --workers 8 --limit-concurrency 16

By default, this codebase will send requests to http://localhost:9999 for evaluating code completion datasets. If you want to use another port or host, please modify --host and --port above and add export OJ_API=YOUR_NEW_URL to our scripts to make the new URL effective.

📊 FTTT Experiments

Below are commands to reproduce all of our experiments on FTTT and other test-time scaling baselines:

# Revision
bash prelim/scripts/revision.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]
# Beam Search
bash prelim/scripts/beam_search.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]
# Best-of-N
bash prelim/scripts/best_of_n.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# Self-Refine
bash prelim/scripts/self_refine.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# Self-Consistency
bash prelim/scripts/self_consistency.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# FTTT (w/o or w/ self-reflection)
bash prelim/scripts/fttt.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100] [FTTT|FTTT+]

Note

All output logs will be stored in the folder outputs under the current directory.

Important

Our codebase uses 🤗HuggingFace transformers to download & load pretrained models, including Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3. If you want to load the models from a local directory, please update the model name in our scripts to your local directory, e.g., meta-llama/Llama-3.1-8B-Instruct => /YOUR/PATH/TO/MODEL_DIRECTORY.

🎯 OpTune Experiments

📌 Training

You can use the following commands to reproduce the training of PEFT baselines as well as OpTune:

# PEFT baselines
bash optim/scripts/baseline.sh [MATH500|GSM8K|MBPP] 42 [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [FT|LoRA|Adapter|IA3|LNTuning]
# OpTune
bash optim/scripts/train.sh [MATH500|GSM8K|MBPP] 42 [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3]

Note

All checkpoints will be stored in the folder outputs under the current directory.

Tip

OpTune currently does not support model architectures other than Llama and Mistral, as it relies on the modification over the original 🤗HuggingFace implementation to inject weight updates during inference. If you want to support other model architectures, please add an implementation to optim/models and modify run.py and evaluator.py to use the correct architecture.

Warning

Although our implementation of OpTune supports distributed data parallelism for multiple GPU training, this feature is not well tested and we suggest training in a single GPU.

⚖️ Testing

After training the models, you can use the following commands to evaluate the trained PEFT baselines and OpTune:

# PEFT baselines
bash optim/scripts/baseline_scale.sh ${PEFT_CHECKPOINT_PATH}
# OpTune
bash optim/scripts/scale.sh ${OPTUNE_CHECKPOINT_PATH}

Typically, we use the checkpoints in the last epoch for evaluation.

To reproduce our OpTune performance, you can download our checkpoints for each dataset from Google Drive and run the evaluation using the commands above:

Dataset	Llama-3.1-8B-Instruct	Mistral-7B-Instruct-v0.3
MATH500
GSM8K
MBPP

🤝 Acknowledgement

Some of our implementations are based on MEND. We would like to thank the authors for sharing their code.

✒️ Citation

Please cite our paper if you find our work useful:

@misc{li2025learning,
      title={Learning to Reason from Feedback at Test-Time}, 
      author={Yanyang Li and Michael Lyu and Liwei Wang},
      year={2025},
      eprint={2502.15771},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.15771}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning to Reason from Feedback at Test-Time

🔔 Updates

🛠️ Installation

💡 Preparation

⏬ Data

🌐 API

📊 FTTT Experiments

🎯 OpTune Experiments

📌 Training

⚖️ Testing

🤝 Acknowledgement

✒️ Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
api		api
optim		optim
prelim		prelim
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Dataset	Link
MATH500
GSM8K
MBPP	🤗
HumanEval	🤗

License

LaVi-Lab/FTTT

Folders and files

Latest commit

History

Repository files navigation

Learning to Reason from Feedback at Test-Time

🔔 Updates

🛠️ Installation

💡 Preparation

⏬ Data

🌐 API

📊 FTTT Experiments

🎯 OpTune Experiments

📌 Training

⚖️ Testing

🤝 Acknowledgement

✒️ Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages