Hellaswag scores #2389

klosax · 2023-07-25T16:24:37Z

This removes the simple HellaSwag perplexity-lines added in PR #2312 and replaces it with a real HellaSwag score test. The simple test was found to be too inaccurate.

The HellaSwag test needs a datafile extracted from the offical HellaSwag dataset which can be found here klosax/hellaswag_text_data.

Parameters added --hellaswagand --hellaswag-tasks.

See my post #2321 for more information.

ggerganov

Haven't tested it, but it would be nice to see what numbers we get and see if this test is useful

klosax · 2023-07-28T19:23:14Z

Haven't tested it, but it would be nice to see what numbers we get and see if this test is useful

I will update my post with results on different models. I am also working on another test, MMLU which measures knowledge. Both tests makes it easier to compare model capabilities compared to simply measuring how good the models are at predicting some random wikipedia articles using perplexity.

ikawrakow · 2023-08-18T06:29:46Z

@klosax

I cannot reproduce the HellaSwag scores found on the HF Leader board using this tool. With scores fluctuating quite a bit with the default number of tasks of 400, I decided to run more tasks, and eventually ran the entire 10042 tasks dataset for vanilla LLaMA-1 and LLaMA-2. The graph shows the HellaSwag score for fp16 as a function of the number of tasks completed. We see that

As implemented here, the final scores of 75.8 and 75.4 are significantly lower than the scores of 78.6 and 77.8 found on HF.
The gap of ~0.4 between LLaMA-1 and LLaMA-2 computed with this PR is only about half of the ~0.8 gap as per HF Leader board

Do you have an explanation for the difference? I'm using latest master using CUDA on a RTX-4080. Is the HellaSwag implementation not quite the same as in the Python world, or do we still have differences in tokenization that affect this particular test, or is perhaps the llama.cpp CUDA implementation not producing quite the same results as PyTorch, all of the above, something else?

Thanks!

klosax · 2023-08-18T09:13:47Z

Thanks for the feedback.

Looking at the table in #2321 and comparing all 400-task scores with HF, there is a linear relationship (see figure below). With this relationship in mind I thought that 400 tasks would be enough for comparing different models, even if the scores differed from HF leaderboard.

Do you have an explanation for the difference?

I think the HF leaderboard might be filling the full ctx window with several sentences (with BOS/EOS in between), but I have not tried this approach yet. Doing it this way should also speed up computation considerably. The current implementation have only one sentence per ctx window.

As you say, the tokenization could also be the problem here. I do not think the small cuda difference should impact the scores this much, but I may be wrong.

Also, I may have done something wrong with the implementation. I tried to follow the lm-evaluation-harness here https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hellaswag.py

Green-Sky · 2023-08-18T09:35:32Z

the leaderboard says:

HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.

I am assuming 10-shot means 10 sentences in 1 context?
edit: or 11 (10 previous)

klosax · 2023-08-18T09:52:46Z

I am assuming 10-shot means 10 sentences in 1 context?

I guess that could work using n_ctx 2048 since the maximal sentence length in the datafile is 171 tokens (llama tokenizer).

Green-Sky · 2023-08-18T10:26:10Z

The current implementation have only one sentence per ctx window.

I think this is called a 0-shot

ikawrakow · 2023-08-18T12:58:21Z

Thank you, both. I now understand the difference. Basically, we are using the validation data set from https://rowanzellers.com/hellaswag and doing a 0-shot prediction. In the training data at that link, there are several training examples from each "activity". So, what they do for the scores reported on the HF Leader board, they pick 10 examples of the same "activity" (pick how? I'm unable to find information on that) and add to the context, along with the validation sentence that needs to be continued. For example, if I look at the 1st task of the validation set, which is from activity "Roof shingle removal", there are 64 "Roof shingle removal" examples in the training dataset, so one can somehow pick 10 of those and add to the context (using BOS/EOS between them?). If this interpretation is correct, I'm moderately surprised that the improvement from 0-shot to 10-shot is so small.

klosax · 2023-08-18T14:11:08Z

Using EOS+BOS between the sentences should effectively reset the logprobs. So it should work by just filling the context with 10 sentences. The logprobs for the ending of each sentence should be computed individually. No need to group by activity. The activity is a label put in front of each sentence, this is how lm-evaluation-harness is using the activity label.

klosax · 2023-08-18T14:21:40Z

This is the first task in my preprocessed dataset:

Roof shingle removal: A man is sitting on a roof. he
3
is using wrap to wrap a pair of skis.
is ripping level tiles off.
is holding a rubik's cube.
starts pulling up roofing on a roof.

The first line is the beginning of the sentence. activity_label + ctx from the original dataset
The second is the id of the "correct" ending, in this case the last ending.
The remaining lines are the different endings.

Now we combine these into 4 sentences and measure the probabilities of the endings only. If the sentence with the highest probability is the "correct" one the model is awarded one point. The resulting score is the percent of tasks with "correct" prediction.
We are randomizing all tasks to be able to make a better measurement when evaluating only 400 tasks.

klosax · 2023-08-18T14:40:16Z

If it is the tokenizer in llama.cpp that is the problem, maybe we could try tokenize the dataset using the python tokenizer and use that as input instead. The beginning and the endings should be tokenized separately so we know where the endings start.

klosax added 5 commits July 25, 2023 17:31

common.h : add hellaswag / remove perplexity-lines

522a29c

common.cpp : add hellaswag / remove perplexity-lines

a40f608

perplexity.cpp : add hellswag scores / remove perplexity-lines

ae4d116

perplexity.cpp : clean up

fae04dd

Merge branch 'ggerganov:master' into hellaswag_scores

5e7a266

klosax mentioned this pull request Jul 26, 2023

add HellaSwag-like tests ianscrivener/llama-cpp-perplexity-scorecard#1

Open

klosax added 5 commits July 27, 2023 16:46

common.h : change default param value

90b2ce3

common.cpp : Change default param

ca4650a

perplexity.cpp : alter wording

d100e9a

common.h : alter wording

630fa8d

common.cpp : alter wording

bf60b6a

ggerganov approved these changes Jul 28, 2023

View reviewed changes

ggerganov merged commit 8a88e58 into ggerganov:master Jul 28, 2023

klosax deleted the hellaswag_scores branch July 28, 2023 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hellaswag scores #2389

Hellaswag scores #2389

klosax commented Jul 25, 2023

ggerganov left a comment

klosax commented Jul 28, 2023

ikawrakow commented Aug 18, 2023 •

edited

Loading

klosax commented Aug 18, 2023 •

edited

Loading

Green-Sky commented Aug 18, 2023 •

edited

Loading

klosax commented Aug 18, 2023

Green-Sky commented Aug 18, 2023

ikawrakow commented Aug 18, 2023

klosax commented Aug 18, 2023

klosax commented Aug 18, 2023 •

edited

Loading

klosax commented Aug 18, 2023

Hellaswag scores #2389

Hellaswag scores #2389

Conversation

klosax commented Jul 25, 2023

ggerganov left a comment

Choose a reason for hiding this comment

klosax commented Jul 28, 2023

ikawrakow commented Aug 18, 2023 • edited Loading

klosax commented Aug 18, 2023 • edited Loading

Green-Sky commented Aug 18, 2023 • edited Loading

klosax commented Aug 18, 2023

Green-Sky commented Aug 18, 2023

ikawrakow commented Aug 18, 2023

klosax commented Aug 18, 2023

klosax commented Aug 18, 2023 • edited Loading

klosax commented Aug 18, 2023

ikawrakow commented Aug 18, 2023 •

edited

Loading

klosax commented Aug 18, 2023 •

edited

Loading

Green-Sky commented Aug 18, 2023 •

edited

Loading

klosax commented Aug 18, 2023 •

edited

Loading