Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hellaswag scores #2389

Merged
merged 10 commits into from
Jul 28, 2023
Merged

Hellaswag scores #2389

merged 10 commits into from
Jul 28, 2023

Conversation

klosax
Copy link
Contributor

@klosax klosax commented Jul 25, 2023

This removes the simple HellaSwag perplexity-lines added in PR #2312 and replaces it with a real HellaSwag score test. The simple test was found to be too inaccurate.

The HellaSwag test needs a datafile extracted from the offical HellaSwag dataset which can be found here klosax/hellaswag_text_data.

Parameters added --hellaswagand --hellaswag-tasks.

See my post #2321 for more information.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested it, but it would be nice to see what numbers we get and see if this test is useful

@ggerganov ggerganov merged commit 8a88e58 into ggerganov:master Jul 28, 2023
@klosax
Copy link
Contributor Author

klosax commented Jul 28, 2023

Haven't tested it, but it would be nice to see what numbers we get and see if this test is useful

I will update my post with results on different models. I am also working on another test, MMLU which measures knowledge. Both tests makes it easier to compare model capabilities compared to simply measuring how good the models are at predicting some random wikipedia articles using perplexity.

@klosax klosax deleted the hellaswag_scores branch July 28, 2023 19:33
@ikawrakow
Copy link
Contributor

ikawrakow commented Aug 18, 2023

@klosax

I cannot reproduce the HellaSwag scores found on the HF Leader board using this tool. With scores fluctuating quite a bit with the default number of tasks of 400, I decided to run more tasks, and eventually ran the entire 10042 tasks dataset for vanilla LLaMA-1 and LLaMA-2. The graph shows the HellaSwag score for fp16 as a function of the number of tasks completed. We see that

  • As implemented here, the final scores of 75.8 and 75.4 are significantly lower than the scores of 78.6 and 77.8 found on HF.
  • The gap of ~0.4 between LLaMA-1 and LLaMA-2 computed with this PR is only about half of the ~0.8 gap as per HF Leader board

Do you have an explanation for the difference? I'm using latest master using CUDA on a RTX-4080. Is the HellaSwag implementation not quite the same as in the Python world, or do we still have differences in tokenization that affect this particular test, or is perhaps the llama.cpp CUDA implementation not producing quite the same results as PyTorch, all of the above, something else?

Thanks!

hella_7B_fp16

@klosax
Copy link
Contributor Author

klosax commented Aug 18, 2023

Thanks for the feedback.

Looking at the table in #2321 and comparing all 400-task scores with HF, there is a linear relationship (see figure below). With this relationship in mind I thought that 400 tasks would be enough for comparing different models, even if the scores differed from HF leaderboard.

hellaswag_chart

Do you have an explanation for the difference?

I think the HF leaderboard might be filling the full ctx window with several sentences (with BOS/EOS in between), but I have not tried this approach yet. Doing it this way should also speed up computation considerably. The current implementation have only one sentence per ctx window.

As you say, the tokenization could also be the problem here. I do not think the small cuda difference should impact the scores this much, but I may be wrong.

Also, I may have done something wrong with the implementation. I tried to follow the lm-evaluation-harness here https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hellaswag.py

@Green-Sky
Copy link
Collaborator

Green-Sky commented Aug 18, 2023

the leaderboard says:

HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.

I am assuming 10-shot means 10 sentences in 1 context?
edit: or 11 (10 previous)

@klosax
Copy link
Contributor Author

klosax commented Aug 18, 2023

I am assuming 10-shot means 10 sentences in 1 context?

I guess that could work using n_ctx 2048 since the maximal sentence length in the datafile is 171 tokens (llama tokenizer).

@Green-Sky
Copy link
Collaborator

The current implementation have only one sentence per ctx window.

I think this is called a 0-shot

@ikawrakow
Copy link
Contributor

Thank you, both. I now understand the difference. Basically, we are using the validation data set from https://rowanzellers.com/hellaswag and doing a 0-shot prediction. In the training data at that link, there are several training examples from each "activity". So, what they do for the scores reported on the HF Leader board, they pick 10 examples of the same "activity" (pick how? I'm unable to find information on that) and add to the context, along with the validation sentence that needs to be continued. For example, if I look at the 1st task of the validation set, which is from activity "Roof shingle removal", there are 64 "Roof shingle removal" examples in the training dataset, so one can somehow pick 10 of those and add to the context (using BOS/EOS between them?). If this interpretation is correct, I'm moderately surprised that the improvement from 0-shot to 10-shot is so small.

@klosax
Copy link
Contributor Author

klosax commented Aug 18, 2023

Using EOS+BOS between the sentences should effectively reset the logprobs. So it should work by just filling the context with 10 sentences. The logprobs for the ending of each sentence should be computed individually. No need to group by activity. The activity is a label put in front of each sentence, this is how lm-evaluation-harness is using the activity label.

@klosax
Copy link
Contributor Author

klosax commented Aug 18, 2023

This is the first task in my preprocessed dataset:

Roof shingle removal: A man is sitting on a roof. he
3
is using wrap to wrap a pair of skis.
is ripping level tiles off.
is holding a rubik's cube.
starts pulling up roofing on a roof.

The first line is the beginning of the sentence. activity_label + ctx from the original dataset
The second is the id of the "correct" ending, in this case the last ending.
The remaining lines are the different endings.

Now we combine these into 4 sentences and measure the probabilities of the endings only. If the sentence with the highest probability is the "correct" one the model is awarded one point. The resulting score is the percent of tasks with "correct" prediction.
We are randomizing all tasks to be able to make a better measurement when evaluating only 400 tasks.

@klosax
Copy link
Contributor Author

klosax commented Aug 18, 2023

If it is the tokenizer in llama.cpp that is the problem, maybe we could try tokenize the dataset using the python tokenizer and use that as input instead. The beginning and the endings should be tokenized separately so we know where the endings start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants