Dataset Preparation

First cd data and run the following commands:

Run the following script to generate the alpaca dataset:

sh gen_data.sh alpaca_data all.json tatsu-lab/alpaca

Next, run the following python script to split the alpaca_data all.json into train.json (40k), dev.json (10k), test.json (2k).

python split_dataset.py all.json 40000 10000 2000

Run the following script to generate the humanEval dataset (test-only):

sh gen_data.sh humaneval_data test.json openai_humaneval

Run the following script to generate the GSM8K dataset (test-only):

sh gen_data.sh gsm8k_test_data test.json gsm8k_test

Data Format

Each json file is a list of dict containing the following keys:

prompt: str, the prompt. 
prefix: list[int], tokenized prompt.
continuation: str, the response generated by the target model. 
tokens: list[int], tokenized continuation.
draft: list[int], next tokens generated from the draft model conditioned on target model's generation.
log_p_7b: list[float], the log probabilities of the draft tokens predicted by the draft model (7b).
log_p_70b: list[float], the log probabilities of the draft tokens predicted by the target model (70b).
p_acc: list[float], the acceptance probabilities of the draft tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Dataset Preparation

Data Format

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Dataset Preparation

Data Format