Fix LLaMA tokenization issue #531

gakada · 2023-05-29T13:42:10Z

As per the regression script:

model	flan-t5-small	gpt2	opt-125m	gpt-neo-125m	pythia-160m	llama-7b
boolq (master)	65.60 ± 1.50	49.90 ± 1.58	54.80 ± 1.57	60.90 ± 1.54	54.00 ± 1.58	71.10 ± 1.43
boolq (llama)	65.60 ± 1.50	49.90 ± 1.58	54.80 ± 1.57	60.90 ± 1.54	54.00 ± 1.58	73.90 ± 1.39
boolq (diff)	0	0	0	0	0	+2.8
lambada_openai (master)	8.80 ± 0.90	31.30 ± 1.47	38.10 ± 1.54	34.70 ± 1.51	32.20 ± 1.48	0
lambada_openai (llama)	8.80 ± 0.90	31.30 ± 1.47	38.10 ± 1.54	34.70 ± 1.51	32.20 ± 1.48	72.20 ± 1.42
lambada_openai (diff)	0	0	0	0	0	+72.2
winogrande (master)	49.30 ± 1.58	53.10 ± 1.58	51.20 ± 1.58	51.40 ± 1.58	52.30 ± 1.58	67.30 ± 1.48
winogrande (llama)	49.30 ± 1.58	53.10 ± 1.58	51.20 ± 1.58	51.40 ± 1.58	52.30 ± 1.58	70.20 ± 1.45
winogrande (diff)	0	0	0	0	0	+2.9
hellaswag (master)	28.70 ± 1.43	31.00 ± 1.46	30.50 ± 1.46	29.60 ± 1.44	29.40 ± 1.44	58.00 ± 1.56
hellaswag (llama)	28.70 ± 1.43	31.00 ± 1.46	30.50 ± 1.46	29.60 ± 1.44	29.40 ± 1.44	59.30 ± 1.55
hellaswag (diff)	0	0	0	0	0	+1.3
openbookqa (master)	12.00 ± 1.45	16.20 ± 1.65	17.20 ± 1.69	15.80 ± 1.63	15.00 ± 1.60	29.00 ± 2.03
openbookqa (llama)	12.00 ± 1.45	16.20 ± 1.65	17.20 ± 1.69	15.80 ± 1.63	15.00 ± 1.60	34.00 ± 2.12
openbookqa (diff)	0	0	0	0	0	+5
piqa (master)	59.30 ± 1.55	62.20 ± 1.53	62.00 ± 1.54	62.10 ± 1.53	60.80 ± 1.54	77.00 ± 1.33
piqa (llama)	59.30 ± 1.55	62.20 ± 1.53	62.00 ± 1.54	62.10 ± 1.53	60.80 ± 1.54	77.80 ± 1.31
piqa (diff)	0	0	0	0	0	+0.8
wikitext (master)	0	3738.14	3206.76	3430.35	3374.91	954.64
wikitext (llama)	0	3738.14	3206.76	3430.35	3374.91	954.64
wikitext (diff)	0	0	0	0	0	0

HaniItani · 2023-06-16T16:00:53Z

Hello @gakada and @haileyschoelkopf ,

I would really appreciate it if you can explain the fix to me and why it only affects LLaMA models.

gakada · 2023-06-17T02:37:14Z

@HaniItani it is because LLaMA tokenizer doesn't satisfy enc(a + b) = enc(a) + enc(b) (which was used in the prediction code, we want logits corresponding to b from model(a + b), so there should be a way to get an encoding for a + b and to separate the b part), so a weaker property should be used, like enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):] (otherwise it is the same, so a + b is encoded and b is separated), e.g.

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('huggyllama/llama-7b')
model = AutoModelForCausalLM.from_pretrained('huggyllama/llama-7b', device_map='auto')

enc = lambda s: tokenizer.encode(s, add_special_tokens=False)

def test(question_enc, answer_enc):
    # get b logprobs out of model(a + b)
    logprobs = F.log_softmax(model(torch.tensor(question_enc + answer_enc).unsqueeze(0).cuda())['logits'], dim=-1).cpu().squeeze(0)[len(question_enc) - 1 : -1]
    guess = logprobs.argmax(dim=-1)
    print(f'|{tokenizer.decode(guess)}| |{tokenizer.decode(answer_enc)}|')

# assume enc(a + b) = enc(a) + enc(b)

test(enc('Paris is the capital of'), enc(' France.'))
# |France2,| | France.|

# assume enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):]

sentence_enc = enc('Paris is the capital of France.')
question_enc = enc('Paris is the capital of')
answer_enc = sentence_enc[len(question_enc):]

test(question_enc, answer_enc)
# |France.| |France.|

This doesn't affect other tokenizers, though in all cases it is also assumed that a doesn't end with spaces.

HaniItani · 2023-06-17T11:49:03Z

@gakada thank you very much for your elaborate response. It's all clear to me now.

PkuRainBow · 2023-06-18T07:09:57Z

+1

* Fix tokenization issue in BaseLM.loglikelihood * Add a regression script * Use entire non-continuation length as context --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

Fix tokenization issue in BaseLM.loglikelihood

ab0e828

gakada requested review from jon-tow and StellaAthena as code owners May 29, 2023 13:42

gakada force-pushed the llama branch from 094b80b to fa97b12 Compare May 29, 2023 13:51

Add a regression script

cd9b305

gakada force-pushed the llama branch from fa97b12 to cd9b305 Compare May 29, 2023 14:49

lintangsutawika mentioned this pull request Jun 1, 2023

Evalution of Llama on Hellaswag #539

Closed

Use entire non-continuation length as context

7cb56a2

haileyschoelkopf merged commit 23f3092 into EleutherAI:master Jun 1, 2023

gakada mentioned this pull request Jun 2, 2023

Move spaces from context to continuation #546

Merged

gakada mentioned this pull request Jun 13, 2023

MMLU task fix #497

Merged

FranxYao mentioned this pull request Jun 20, 2023

Result difference on MMLU for falcon 7b FranxYao/chain-of-thought-hub#37

Closed

hiyouga mentioned this pull request Oct 20, 2023

使用 evaluate.py 在 mmlu 上 evaluate llama2-7b-base 与llama2-7b-chat 效果远低于公开评测结果 hiyouga/LLaMA-Factory#1232

Closed

HishamAlyahya mentioned this pull request Dec 12, 2023

[FIX] Add _encode_pair() to vllm_causallms.py #1108

Closed

hash2430 mentioned this pull request Aug 23, 2024

llama3 tokenizer compatibility of DPOTrainer huggingface/trl#1960

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LLaMA tokenization issue #531

Fix LLaMA tokenization issue #531

gakada commented May 29, 2023 •

edited

Loading

HaniItani commented Jun 16, 2023

gakada commented Jun 17, 2023

HaniItani commented Jun 17, 2023

PkuRainBow commented Jun 18, 2023

Fix LLaMA tokenization issue #531

Fix LLaMA tokenization issue #531

Conversation

gakada commented May 29, 2023 • edited Loading

HaniItani commented Jun 16, 2023

gakada commented Jun 17, 2023

HaniItani commented Jun 17, 2023

PkuRainBow commented Jun 18, 2023

gakada commented May 29, 2023 •

edited

Loading