- [August, 2022] Data release:
GSM8K
andStrategyQA
, generated bycode-davinci-002
.
Each subfolder in the /data
folder corresponds to a reasoning benchmark. You can find more details about the benchmark from the README.md
file of the subfolder.
Generally, these subfolders consist of train.jsonl
and test.jsonl
files.
Each line of these files shares the same format:
{
// context: the prompt sequence we provide to the language model.
// {Qi}/{Ei}/{Ai} represents the question/chain-of-thought/answer of the i-th exemplar.
// {Q} represents the question for inference.
"context": "Question:\n{Q1}\n{E1}\n#### {A1}\n\n{...}Question:\n{Qk}\n{Ek}\n#### {Ak}\n\nQuestion:\n{Q}\nAnswer:\n",
// samples: multiple output sequences sampled from the language model, given the prompt sequence as input
"samples": [
"{E}\n#### {A}\n\n",
"{E'}\n#### {A'}\n\n",
"{E''}\n#### {A'''}\n\n"
...
],
// {E*}/{A*} represents the ground truth chain-of-thought/answer of {Q}.
// if the dataset doesn't provide ground truth chain-of-thoughts, {E*} will be "No chain-of-thought provided.".
"metadata": {
"question": "{Q}",
"ground_truth": "{E*}#### {A*}"
}
}
Currently, all data we release in this repository are generated by the code-davinci-002
model provided by OpenAI.
Given the train.jsonl
and test.jsonl
files that are generated by large-scale pretrained language models, you can use code provided in the code
folder to reproduce our results. Here we take the gsm8k
dataset as an example.
- Install dependencies according to the
environment
properties ofcode/verifier_data_prepare.yaml
andverifier_train.yaml
. - Register a wandb account and get a wandb API key.
- Create a new folder (denoted as
{EXEC_DIR}
) and initialize this folder as follows:
$ {EXEC_DIR}
.
├── train_dir
│ └── train.jsonl
├── test_dir
│ └── test.jsonl
├── train_preprocessed // this is an empty folder
├── test_preprocessed // this is an empty folder
└── exec // this is an empty folder
In the code/src
folder, run these two commands:
python verifier_data_prepare.py
--generator_result_file {EXEC_DIR}/train_dir
--output_dir {EXEC_DIR}/train_preprocessed
--split train
--random_seed 233
--dataset_name GSM8K
python verifier_data_prepare.py
--generator_result_file {EXEC_DIR}/test_dir
--output_dir {EXEC_DIR}/test_preprocessed
--split dev
--random_seed 233
--dataset_name GSM8K
You can find the detailed parameter specifications in code/verifier_data_prepare.yaml
.
In the code/src
folder, run these commands:
export WANDB_API_KEY={your_wandb_api_key_here}
export WANDB_PROJECT=deberta-verifier
export WANDB_RUN_ID=gsm8k-codedavinci002
export WANDB_TAGS=deberta_verifier
export NCCL_DEBUG=INFO
deepspeed --num_gpus=8 run_ner.py
--task_type NER
--dataset_name GSM8K
--train_data {EXEC_DIR}/train_preprocessed
--test_data {EXEC_DIR}/test_preprocessed
--output_dir {EXEC_DIR}/exec
--max_seq_length 512
--per_device_train_batch_size 8
--per_device_eval_batch_size 64
--lr_scheduler_type constant
--seed 233
--logging_steps 10
--overwrite_output_dir
--alpha 0.1
--deepspeed ds_config.json
You can find the detailed parameter specifications in code/verifier_train.yaml
.
All the training/evaluation logs will be uploaded to your wandb account.
Key logged metrics include:
eval_weighted_voting_top1_accuracy@100
: solve rate of DIVERSE (our approach);eval_voting_top1_accuracy@100
: solve rate of DIVERSE w/o verifier (i.e., each candidate is weighted equally);eval_verifier_top1_accuracy@100
: solve rate of DIVERSE w/o voting (i.e., selecting the candidate with highest verifier score).
If our work is useful for you, please consider citing our paper:
@article{li2022advance,
title={On the Advance of Making Language Models Better Reasoners},
author={Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu},
journal={arXiv preprint arXiv:2206.02336},
year={2022}
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Please note that this repo is under MIT License.