Skip to content

Latest commit

 

History

History
43 lines (30 loc) · 1.28 KB

readme.md

File metadata and controls

43 lines (30 loc) · 1.28 KB

Dataset Preparation

First cd data and run the following commands:

  1. Run the following script to generate the alpaca dataset:
sh gen_data.sh alpaca_data all.json tatsu-lab/alpaca

Next, run the following python script to split the alpaca_data all.json into train.json (40k), dev.json (10k), test.json (2k).

python split_dataset.py all.json 40000 10000 2000
  1. Run the following script to generate the humanEval dataset (test-only):
sh gen_data.sh humaneval_data test.json openai_humaneval
  1. Run the following script to generate the GSM8K dataset (test-only):
sh gen_data.sh gsm8k_test_data test.json gsm8k_test

Data Format

Each json file is a list of dict containing the following keys:

prompt: str, the prompt. 
prefix: list[int], tokenized prompt.
continuation: str, the response generated by the target model. 
tokens: list[int], tokenized continuation.
draft: list[int], next tokens generated from the draft model conditioned on target model's generation.
log_p_7b: list[float], the log probabilities of the draft tokens predicted by the draft model (7b).
log_p_70b: list[float], the log probabilities of the draft tokens predicted by the target model (70b).
p_acc: list[float], the acceptance probabilities of the draft tokens.