-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing a single prompt to use several formats for one eval. #398
Conversation
My random 2 cents: if we're doing few-shots too, it would be useful to see the metrics separately for each prompt style too. And at that point we could just define them as separate tasks, but maybe I'm misunderstanding the use case 🤔 |
The idea is that, for some evaluations, some models are overfitting a given prompt format - so we want to evaluate a single evaluation on a range of prompt formats to mitigate this bias. |
Ok we're on the same page then! Currently I'm checking that by copying/generating the tasks with different prompt functions and looking at the |
Hm if you wanted to do this the best would do to define one task per prompt function instead I think |
Not sure how to implement prompt variation for fewshot but I think we will need it |
We would need to refacto entirely the few shot management. At the moment, all few shot docs are loaded once, but here we would need to dynamically apply the formatting. It's doable but will be a bit messy imo. |
This can be tested with the following custom task.
Only question is the management of few shot: do we want to force few-shot samples to have the same format as the current question, or do we just assume that these tests will be run 0 shot?
If we force the few shot samples, that means we'll need to reload them, or change the loading system, since we fix the few shot sample shape at creation.