[FIX] Add _encode_pair() to vllm_causallms.py #1108

HishamAlyahya · 2023-12-12T17:01:18Z

Most models using the Llama tokenizer currently show drasticly different evaluation results when tested with the new vLLM implementation. The culprit is identical to a, previous solved, llama tokenization issue with the HF implemetation in this PR.

The _encode_pair() method is used for the HuggingFace model, where whole_enc and context_enc are encoded first before setting continuation_enc to be whole_enc[len(context_enc):], while basic batch encoding is used for vLLM.

The current vLLM implementation does the encoding of the pair by just running both the inputs to the tokenizer as such:

context_enc, continuation_enc = self.tokenizer(
          [context, continuation],
          truncation="do_not_truncate",
          add_special_tokens=False,
          return_attention_mask=False,
      ).input_ids

This inconsistency reflects strongly on the results of MMLU due to the continuation encoding having an extra empty string token in the current implementation, while the _encode_pair way returns only the token of the letter.

The following are normalized accuracies for 0 shot evaluations on meta-llama/Llama-2-7b-hf:

Implementation	ARC Easy	ARC Challenge	MMLU
HF	74.579	46.331	41.775
VLLM (current)	53.577	40.529	38.769
VLLM (fixed)	74.58	46.25	41.82

This modification also address this issue, since GeneZC/MiniMA-3B uses the LlamaTokenizer.

From our testing, it seems that this discrepancy appears in most models that use LlamaTokenizer, such as meta-llama/Llama-2-7b and openlm-research/open_llama_7b_v2. However, some models, like 01-ai/Yi-6B, do not show this issue. The only notable difference in the tokenizer config is that they set add_bos_token=False, which is not the default.

CLAassistant · 2023-12-12T17:01:25Z

All committers have signed the CLA.

haileyschoelkopf · 2023-12-12T19:33:45Z

Thanks very much for the PR! This fix is also in #1035 , but will merge this first.

To delete

haileyschoelkopf · 2023-12-22T04:05:42Z

Closing as completed in #1035 , sorry for the confusion and thank you for the contribution!

Add _encode_pair() to vllm_causallms.py

cf5d9f4

HishamAlyahya requested review from haileyschoelkopf, lintangsutawika and StellaAthena as code owners December 12, 2023 17:01

StellaAthena previously approved these changes Dec 14, 2023

View reviewed changes

StellaAthena self-requested a review December 18, 2023 16:33

haileyschoelkopf closed this Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Add _encode_pair() to vllm_causallms.py #1108

[FIX] Add _encode_pair() to vllm_causallms.py #1108

HishamAlyahya commented Dec 12, 2023

CLAassistant commented Dec 12, 2023 •

edited

Loading

haileyschoelkopf commented Dec 12, 2023

haileyschoelkopf commented Dec 22, 2023

[FIX] Add _encode_pair() to vllm_causallms.py #1108

[FIX] Add _encode_pair() to vllm_causallms.py #1108

Conversation

HishamAlyahya commented Dec 12, 2023

CLAassistant commented Dec 12, 2023 • edited Loading

haileyschoelkopf commented Dec 12, 2023

haileyschoelkopf commented Dec 22, 2023

CLAassistant commented Dec 12, 2023 •

edited

Loading