Skip to content
/ FTTT Public

Official code for ''Learning to Reason from Feedback at Test-Time''.

License

Notifications You must be signed in to change notification settings

LaVi-Lab/FTTT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Learning to Reason from Feedback at Test-Time

arXiv License: MIT

This is the official repository of the paper "Learning to Reason from Feedback at Test-Time".

TL;DR: We introduce a novel paradigm to address the challenge of exploiting test-time feedback for improved reasoning performance, Feedback-based Test-Time Training (FTTT), which formulates feedback utilization as a training problem. We additionally propose a learnable test-time optimizer, OpTune, to make FTTT more effective.

๐Ÿ”” Updates

  • [2025-02-25] ๐Ÿ”ฅ We release the code of our paper. The detailed instruction can be found below.

๐Ÿ› ๏ธ Installation

Our implementation is based on python=3.12. Follow the commands below to prepare the Python environment (we recommend using Miniconda to setup the environment):

git clone https://github.com/LaVi-Lab/FTTT.git
cd FTTT
conda install pytorch==2.4.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt

๐Ÿ’ก Preparation

โฌ Data

Note

All downloaded datasets should be stored in a folder named datasets.

  • Another important asset of this repo is the question indices of the test sets, as we only evaluate hard questions that cannot be solved by the raw LLM initially. You can download these indices from Google Drive and unzip them to the folder metadata.

    Reproducing question indices

    If you want to reproduce the question indices by yourself, you can perform greedy decoding on each dataset:

    # Greedy decoding
    bash prelim/scripts/greedy.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]

    This command will automatically save a file {llama|mistral}_{MATH500|GSM8K|MBPP|HumanEval}_correct_cases.json under the current directory, which contains the indices of questions that can be correctly answered by greedy decoding.

  • The training data of OpTune consists of solutions generated from the raw LLM. We provide the Google Drive link to download our training data.

    Reproducing OpTune training data

    You can generate the training data for OpTune by yourself:

    bash optim/scripts/gen.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP]

Finally, all data should be organized as follows:

FTTT
|-- datasets [Raw training & evaluation data here!]
|-- metadata [Test set indices here!]
|-- api
|-- optim
|   |-- cache [OpTune Training data here!]
|   |-- data
|   |-- models
|   |-- scripts
|   |-- ...
|-- prelim
|-- .gitignore
|-- LICENSE
|-- README.md
|-- requirements.txt

๐ŸŒ API

For the code generation task, we deploy a local API service to run the generated code and check if it passes all test cases. We use the following commands to launch the service:

cd api
uvicorn oj_api:app --host 0.0.0.0 --port 9999 --workers 8 --limit-concurrency 16

By default, this codebase will send requests to http://localhost:9999 for evaluating code completion datasets. If you want to use another port or host, please modify --host and --port above and add export OJ_API=YOUR_NEW_URL to our scripts to make the new URL effective.

๐Ÿ“Š FTTT Experiments

Below are commands to reproduce all of our experiments on FTTT and other test-time scaling baselines:

# Revision
bash prelim/scripts/revision.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]
# Beam Search
bash prelim/scripts/beam_search.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval]
# Best-of-N
bash prelim/scripts/best_of_n.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# Self-Refine
bash prelim/scripts/self_refine.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# Self-Consistency
bash prelim/scripts/self_consistency.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100]
# FTTT (w/o or w/ self-reflection)
bash prelim/scripts/fttt.sh [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [MATH500|GSM8K|MBPP|HumanEval] [42|85|100] [FTTT|FTTT+]

Note

All output logs will be stored in the folder outputs under the current directory.

Important

Our codebase uses ๐Ÿค—HuggingFace transformers to download & load pretrained models, including Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3. If you want to load the models from a local directory, please update the model name in our scripts to your local directory, e.g., meta-llama/Llama-3.1-8B-Instruct => /YOUR/PATH/TO/MODEL_DIRECTORY.

๐ŸŽฏ OpTune Experiments

๐Ÿ“Œ Training

You can use the following commands to reproduce the training of PEFT baselines as well as OpTune:

# PEFT baselines
bash optim/scripts/baseline.sh [MATH500|GSM8K|MBPP] 42 [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3] [FT|LoRA|Adapter|IA3|LNTuning]
# OpTune
bash optim/scripts/train.sh [MATH500|GSM8K|MBPP] 42 [Llama-3.1-8B-Instruct|Mistral-7B-Instruct-v0.3]

Note

All checkpoints will be stored in the folder outputs under the current directory.

Tip

OpTune currently does not support model architectures other than Llama and Mistral, as it relies on the modification over the original ๐Ÿค—HuggingFace implementation to inject weight updates during inference. If you want to support other model architectures, please add an implementation to optim/models and modify run.py and evaluator.py to use the correct architecture.

Warning

Although our implementation of OpTune supports distributed data parallelism for multiple GPU training, this feature is not well tested and we suggest training in a single GPU.

โš–๏ธ Testing

After training the models, you can use the following commands to evaluate the trained PEFT baselines and OpTune:

# PEFT baselines
bash optim/scripts/baseline_scale.sh ${PEFT_CHECKPOINT_PATH}
# OpTune
bash optim/scripts/scale.sh ${OPTUNE_CHECKPOINT_PATH}

Typically, we use the checkpoints in the last epoch for evaluation.

To reproduce our OpTune performance, you can download our checkpoints for each dataset from Google Drive and run the evaluation using the commands above:

Dataset Llama-3.1-8B-Instruct Mistral-7B-Instruct-v0.3
MATH500 Google Drive Icon Google Drive Icon
GSM8K Google Drive Icon Google Drive Icon
MBPP Google Drive Icon Google Drive Icon

๐Ÿค Acknowledgement

Some of our implementations are based on MEND. We would like to thank the authors for sharing their code.

โœ’๏ธ Citation

Please cite our paper if you find our work useful:

@misc{li2025learning,
      title={Learning to Reason from Feedback at Test-Time}, 
      author={Yanyang Li and Michael Lyu and Liwei Wang},
      year={2025},
      eprint={2502.15771},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.15771}, 
}

About

Official code for ''Learning to Reason from Feedback at Test-Time''.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published