-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hellaswag scores #2389
Hellaswag scores #2389
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested it, but it would be nice to see what numbers we get and see if this test is useful
I will update my post with results on different models. I am also working on another test, MMLU which measures knowledge. Both tests makes it easier to compare model capabilities compared to simply measuring how good the models are at predicting some random wikipedia articles using perplexity. |
I cannot reproduce the HellaSwag scores found on the HF Leader board using this tool. With scores fluctuating quite a bit with the default number of tasks of 400, I decided to run more tasks, and eventually ran the entire 10042 tasks dataset for vanilla LLaMA-1 and LLaMA-2. The graph shows the HellaSwag score for
Do you have an explanation for the difference? I'm using latest master using CUDA on a RTX-4080. Is the HellaSwag implementation not quite the same as in the Python world, or do we still have differences in tokenization that affect this particular test, or is perhaps the Thanks! |
Thanks for the feedback. Looking at the table in #2321 and comparing all 400-task scores with HF, there is a linear relationship (see figure below). With this relationship in mind I thought that 400 tasks would be enough for comparing different models, even if the scores differed from HF leaderboard.
I think the HF leaderboard might be filling the full ctx window with several sentences (with BOS/EOS in between), but I have not tried this approach yet. Doing it this way should also speed up computation considerably. The current implementation have only one sentence per ctx window. As you say, the tokenization could also be the problem here. I do not think the small cuda difference should impact the scores this much, but I may be wrong. Also, I may have done something wrong with the implementation. I tried to follow the lm-evaluation-harness here https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hellaswag.py |
the leaderboard says:
I am assuming 10-shot means 10 sentences in 1 context? |
I guess that could work using n_ctx 2048 since the maximal sentence length in the datafile is 171 tokens (llama tokenizer). |
I think this is called a 0-shot |
Thank you, both. I now understand the difference. Basically, we are using the validation data set from https://rowanzellers.com/hellaswag and doing a 0-shot prediction. In the training data at that link, there are several training examples from each "activity". So, what they do for the scores reported on the HF Leader board, they pick 10 examples of the same "activity" (pick how? I'm unable to find information on that) and add to the context, along with the validation sentence that needs to be continued. For example, if I look at the 1st task of the validation set, which is from activity "Roof shingle removal", there are 64 "Roof shingle removal" examples in the training dataset, so one can somehow pick 10 of those and add to the context (using BOS/EOS between them?). If this interpretation is correct, I'm moderately surprised that the improvement from 0-shot to 10-shot is so small. |
Using EOS+BOS between the sentences should effectively reset the logprobs. So it should work by just filling the context with 10 sentences. The logprobs for the ending of each sentence should be computed individually. No need to group by activity. The activity is a label put in front of each sentence, this is how lm-evaluation-harness is using the activity label. |
This is the first task in my preprocessed dataset:
The first line is the beginning of the sentence. Now we combine these into 4 sentences and measure the probabilities of the endings only. If the sentence with the highest probability is the "correct" one the model is awarded one point. The resulting score is the percent of tasks with "correct" prediction. |
If it is the tokenizer in llama.cpp that is the problem, maybe we could try tokenize the dataset using the python tokenizer and use that as input instead. The beginning and the endings should be tokenized separately so we know where the endings start. |
This removes the simple HellaSwag perplexity-lines added in PR #2312 and replaces it with a real HellaSwag score test. The simple test was found to be too inaccurate.
The HellaSwag test needs a datafile extracted from the offical HellaSwag dataset which can be found here klosax/hellaswag_text_data.
Parameters added
--hellaswag
and--hellaswag-tasks
.See my post #2321 for more information.