Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Add _encode_pair() to vllm_causallms.py #1108

Closed

Conversation

HishamAlyahya
Copy link
Contributor

Most models using the Llama tokenizer currently show drasticly different evaluation results when tested with the new vLLM implementation. The culprit is identical to a, previous solved, llama tokenization issue with the HF implemetation in this PR.

The _encode_pair() method is used for the HuggingFace model, where whole_enc and context_enc are encoded first before setting continuation_enc to be whole_enc[len(context_enc):], while basic batch encoding is used for vLLM.

The current vLLM implementation does the encoding of the pair by just running both the inputs to the tokenizer as such:

context_enc, continuation_enc = self.tokenizer(
          [context, continuation],
          truncation="do_not_truncate",
          add_special_tokens=False,
          return_attention_mask=False,
      ).input_ids

This inconsistency reflects strongly on the results of MMLU due to the continuation encoding having an extra empty string token in the current implementation, while the _encode_pair way returns only the token of the letter.

The following are normalized accuracies for 0 shot evaluations on meta-llama/Llama-2-7b-hf:

Implementation ARC Easy ARC Challenge MMLU
HF 74.579 46.331 41.775
VLLM (current) 53.577 40.529 38.769
VLLM (fixed) 74.58 46.25 41.82

This modification also address this issue, since GeneZC/MiniMA-3B uses the LlamaTokenizer.

From our testing, it seems that this discrepancy appears in most models that use LlamaTokenizer, such as meta-llama/Llama-2-7b and openlm-research/open_llama_7b_v2. However, some models, like 01-ai/Yi-6B, do not show this issue. The only notable difference in the tokenizer config is that they set add_bos_token=False, which is not the default.

@CLAassistant
Copy link

CLAassistant commented Dec 12, 2023

CLA assistant check
All committers have signed the CLA.

@haileyschoelkopf
Copy link
Collaborator

Thanks very much for the PR! This fix is also in #1035 , but will merge this first.

StellaAthena
StellaAthena previously approved these changes Dec 14, 2023
@StellaAthena StellaAthena self-requested a review December 18, 2023 16:33
@haileyschoelkopf
Copy link
Collaborator

Closing as completed in #1035 , sorry for the confusion and thank you for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants