First cd data
and run the following commands:
- Run the following script to generate the alpaca dataset:
sh gen_data.sh alpaca_data all.json tatsu-lab/alpaca
Next, run the following python script to split the alpaca_data all.json
into train.json
(40k), dev.json
(10k), test.json
(2k).
python split_dataset.py all.json 40000 10000 2000
- Run the following script to generate the humanEval dataset (test-only):
sh gen_data.sh humaneval_data test.json openai_humaneval
- Run the following script to generate the GSM8K dataset (test-only):
sh gen_data.sh gsm8k_test_data test.json gsm8k_test
Each json
file is a list
of dict
containing the following keys:
prompt: str, the prompt.
prefix: list[int], tokenized prompt.
continuation: str, the response generated by the target model.
tokens: list[int], tokenized continuation.
draft: list[int], next tokens generated from the draft model conditioned on target model's generation.
log_p_7b: list[float], the log probabilities of the draft tokens predicted by the draft model (7b).
log_p_70b: list[float], the log probabilities of the draft tokens predicted by the target model (70b).
p_acc: list[float], the acceptance probabilities of the draft tokens.