Add Winogrande evaluation #5015

ikawrakow · 2024-01-18T08:30:45Z

It is not the most efficient implementation, but a) I wanted to have something that looks like to be working 1st, and b) evaluation time is not that long (70 seconds for the 1267 tasks of the Winogrande evaluation dataset for Mistral-7B using CUDA on RTX-4080), so performance improvements are not as important as they would be for HellaSwag.

I'm not quite getting the scores reported on the HF Leader board (HFLB). For the Winogrande evaluation dataset (see https://huggingface.co/datasets/ikawrakow/winogrande-eval-for-llama.cpp), which contains 1267 tasks, I get 73.56 vs 78.37 reported on HFLB for Mistral-7B. Statistical uncertainty (1-sigma) is 1.24, so there is a tiny chance that this could be simply statistics. On the other hand, we do get lower HellaSwag scores compared to HFLB as well, so this is kind of expected.

Interestingly enough, the Winogrande score varies quite a bit, depending on what parts of the context are included when computing the average log-likelihood, so perhaps I haven't found the right magic subset of tokens that maximizes the score.

Usage:

./perplexity -m model -f winogrande-debiased-eval.csv --winogrande [--winogrande-tasks N] [other params]

If --winogrande-tasks is omitted, all tasks in the dataset will be evaluated.

Update:
I ran Winogrande on the extra large Winogrande training dataset (40397 tasks) with Mistral-7B. I get 83.79 +/- 0.18, which is significantly higher than the HFLB value. Getting a higher value is expected, as this is training data and models have most likely been trained on this dataset, but it still gives confidence that the implementation is correct.

It doesn't look like it is working - why? For Mistral-7B it is barely better than random chance (score ~60% for 1267 tasks), while I see Mistral-7B scoring 78.4% on the HF leader board. 1-sigma statistical uncertainty for 1267 tasks is ~1.4, so no way the difference is due to statistics.

Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before.

Mistral-7B score is now 73.56. Still not quite 78.4 but getting there. We are also getting a lower score on HellaSwag compared to HF leader board, so I'm not expecting we will get up to 78.4 anyway. It looks like it is better to skip the choice word(s) when evaluating the average log-likelihood. This kind of makes sense because a more common word (in Winogrande this is often a name) will have a higher probability without knowing about the follow up context, and this will skew the log-likelihood towards the more common word. We can only do this if the choice words are not last in the sentence. It also looks like it is better to skip the punctuation at the end of the sentence, provided the choice words are not last.

* winogrande: simple implementation It doesn't look like it is working - why? For Mistral-7B it is barely better than random chance (score ~60% for 1267 tasks), while I see Mistral-7B scoring 78.4% on the HF leader board. 1-sigma statistical uncertainty for 1267 tasks is ~1.4, so no way the difference is due to statistics. * winogrande: somewhat better Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before. * winogrande: improving Mistral-7B score is now 73.56. Still not quite 78.4 but getting there. We are also getting a lower score on HellaSwag compared to HF leader board, so I'm not expecting we will get up to 78.4 anyway. It looks like it is better to skip the choice word(s) when evaluating the average log-likelihood. This kind of makes sense because a more common word (in Winogrande this is often a name) will have a higher probability without knowing about the follow up context, and this will skew the log-likelihood towards the more common word. We can only do this if the choice words are not last in the sentence. It also looks like it is better to skip the punctuation at the end of the sentence, provided the choice words are not last. * winogrande: add dataset instructions --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Kawrakow added 4 commits January 17, 2024 21:40

winogrande: somewhat better

2605b92

Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before.

winogrande: add dataset instructions

e3a17dc

ggerganov approved these changes Jan 18, 2024

View reviewed changes

ikawrakow merged commit 682986a into master Jan 18, 2024
39 of 47 checks passed

ikawrakow deleted the ik/winogrande branch January 18, 2024 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Winogrande evaluation #5015

Add Winogrande evaluation #5015

ikawrakow commented Jan 18, 2024 •

edited

Loading

Add Winogrande evaluation #5015

Add Winogrande evaluation #5015

Conversation

ikawrakow commented Jan 18, 2024 • edited Loading

ikawrakow commented Jan 18, 2024 •

edited

Loading