-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix HellaSwag #2805
Fix HellaSwag #2805
Conversation
Unconditionally doesn't sound right, but there's a reason to add it if it's not there. if (!text.empty() && text.front() != ' ') result = "\xe2\x96\x81"; If just leaving a |
PR #2806 restores the pre-gguf accuracy. |
I can confirm this to be true, so strictly speaking one could just close this PR. Nevertheless, I still think it is worth having this change in the HellaSwag calculation as it makes it more robust to potential future tokenization changes. |
The reason I did separate the context and endings was to be sure to know exactly where the tokenized context ends. The decision to have the separation space in the endings was that is more compatible with sentencepiece, which works much better with spaces prepended to all words. |
Have you tested that this works with the bpe tokenizer ie Falcon-7b? F16 model should give 76.75 on 400 tasks. |
We should still expect the scores to be a couple of points lower than what you'd get with lm-evaluation-harness HellaSwag (with an AutoGPTQ model), right? Any insights into why the difference? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the different eps
value explain the remaining small difference that you observe after the fixes in the tokenizer?
Pre-GGUF used eps 5e-6
and GGUF uses 1e-5
It could be. Pre-GGUF I ran with |
I'm not able to test. Downloaded the pytorch files from HF and ran |
You are using the official model? https://huggingface.co/tiiuae/falcon-7b/tree/main |
Try parameter |
CUDA currently doesn't work with falcon when offloading the KV. IIRC, the maximum layers that can be offloaded for 7B is 33 and 60 for 40B. |
Yes.
Running the |
Cant find any F16 but here is Q8_0 and Q4_0, I have not tested if they works: |
Thank you, @slaren. This fixes it. I had missed this part (or better, it had slipped my mind). |
Yes, I now get HellaSwag = 76.75 for Falcon-7B after 400 tasks. |
Great! |
* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
The HellaSwag scores are 4-5 percentage points lower compared to what is posted in #2321. The score change occurred after #2398 was merged. The difference is due to changes in the tokenizer related to how space is being handled. In a HellaSwag task one evaluates 4 possible endings after a given context. These endings begin with space, and this leads to a different tokenization on current master compared to what we had before the GGUF changes in #2398. I tried adding the space to the context rather than the endings, but this did not improve the score. So, what this PR does to avoid space handling issues is to tokenize context+ending together, and then evaluate the tokens after the context tokens. This improves the HellaSwag score significantly. It is not exactly the same as what we had before #2398, but comes close: For LLaMA-v2-7B it is 76.75 (PR) vs 77.25 (before GGUF) after 400 tasks, 74.9 vs 75.2 after 1000 tasks, and 75.35 vs 75.4 after 2000 tasks.
Update: Final HellaSwag scores after 10042 tasks for LLaMA-v2-7B:
Update 2: The culprit is this line:
llama.cpp/llama.cpp
Line 3030 in bae5c5f
If I comment it out, I recover pre-GGUF HellaSwag scores. If I replace it with
I get the score of this PR without the change in the PR.
So, the question is: why was unconditionally adding an escaped white space to the string to be tokenized required?
@ggerganov @goerch