Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Winogrande evaluation #5015

Merged
merged 4 commits into from
Jan 18, 2024
Merged

Add Winogrande evaluation #5015

merged 4 commits into from
Jan 18, 2024

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Jan 18, 2024

It is not the most efficient implementation, but a) I wanted to have something that looks like to be working 1st, and b) evaluation time is not that long (70 seconds for the 1267 tasks of the Winogrande evaluation dataset for Mistral-7B using CUDA on RTX-4080), so performance improvements are not as important as they would be for HellaSwag.

I'm not quite getting the scores reported on the HF Leader board (HFLB). For the Winogrande evaluation dataset (see https://huggingface.co/datasets/ikawrakow/winogrande-eval-for-llama.cpp), which contains 1267 tasks, I get 73.56 vs 78.37 reported on HFLB for Mistral-7B. Statistical uncertainty (1-sigma) is 1.24, so there is a tiny chance that this could be simply statistics. On the other hand, we do get lower HellaSwag scores compared to HFLB as well, so this is kind of expected.

Interestingly enough, the Winogrande score varies quite a bit, depending on what parts of the context are included when computing the average log-likelihood, so perhaps I haven't found the right magic subset of tokens that maximizes the score.

Usage:

./perplexity -m model -f winogrande-debiased-eval.csv --winogrande [--winogrande-tasks N] [other params]

If --winogrande-tasks is omitted, all tasks in the dataset will be evaluated.

Update:
I ran Winogrande on the extra large Winogrande training dataset (40397 tasks) with Mistral-7B. I get 83.79 +/- 0.18, which is significantly higher than the HFLB value. Getting a higher value is expected, as this is training data and models have most likely been trained on this dataset, but it still gives confidence that the implementation is correct.

It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.
Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.
Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.

It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.

It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.
@ikawrakow ikawrakow merged commit 682986a into master Jan 18, 2024
39 of 47 checks passed
@ikawrakow ikawrakow deleted the ik/winogrande branch January 18, 2024 11:46
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* winogrande: simple implementation

It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.

* winogrande: somewhat better

Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.

* winogrande: improving

Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.

It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.

It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.

* winogrande: add dataset instructions

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* winogrande: simple implementation

It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.

* winogrande: somewhat better

Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.

* winogrande: improving

Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.

It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.

It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.

* winogrande: add dataset instructions

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants