Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing a single prompt to use several formats for one eval. #398

Merged
merged 3 commits into from
Nov 22, 2024

Conversation

clefourrier
Copy link
Member

@clefourrier clefourrier commented Nov 19, 2024

This can be tested with the following custom task.

from lighteval.metrics.metrics import Metrics
from lighteval.tasks.default_prompts import LETTER_INDICES
from lighteval.tasks.lighteval_task import LightevalTaskConfig
from lighteval.tasks.requests import Doc

def prompt_fn(line, task_name: str = None):
    """Defines how to go from a dataset line to a doc object.
    Follow examples in src/lighteval/tasks/tasks_prompt_formatting.py, or get more info
    about what this function should do in the README.
    """
    
    ix = line["__index"]

    if ix % 3 == 0:
        # question, must predict correct choice
        return Doc(
            task_name=task_name,
            query=f"Question: {line['question']}\nAnswer:",
            choices=[f" {c}" for c in line["choices"]["text"]],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )

    if (ix + 1) % 3 == 0:
        # question + label + choice, must predict correct label
        query = f"Question: {line['question']}\n"
        query += "".join([f"\n{key}. {choice}" for key, choice in zip(LETTER_INDICES, line["choices"]["text"])])
        query += "\nAnswer:"
        return Doc(
            task_name=task_name,
            query=query,
            choices=LETTER_INDICES[: len(line["choices"]["text"])],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )
    
    if (ix + 2) % 3 == 0:
        # question + choice, must predict correct choice
        query = f"Question: {line['question']}\n"
        query += "".join([f"\n- {choice}" for key, choice in zip(LETTER_INDICES, line["choices"]["text"])])
        query += "\nAnswer:"
        return Doc(
            task_name=task_name,
            query=query,
            choices=line["choices"]["text"],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )


task = LightevalTaskConfig(
    name="arc_multi_prompts",
    prompt_function=prompt_fn,
    hf_repo="ai2_arc",
    hf_subset="ARC-Challenge",
    evaluation_splits=["test"],
    generation_size=1,
    metric=[Metrics.loglikelihood_acc, Metrics.loglikelihood_acc_norm_nospace],
    trust_dataset=True,
    stop_sequence=["\n"],
)
TASKS_TABLE = [task]

if __name__ == "__main__":
    print(t.name for t in TASKS_TABLE)
    print(len(TASKS_TABLE))

Only question is the management of few shot: do we want to force few-shot samples to have the same format as the current question, or do we just assume that these tests will be run 0 shot?

If we force the few shot samples, that means we'll need to reload them, or change the loading system, since we fix the few shot sample shape at creation.

@clefourrier clefourrier requested a review from NathanHB November 19, 2024 15:57
@anton-l
Copy link
Member

anton-l commented Nov 20, 2024

My random 2 cents: if we're doing few-shots too, it would be useful to see the metrics separately for each prompt style too. And at that point we could just define them as separate tasks, but maybe I'm misunderstanding the use case 🤔

@clefourrier
Copy link
Member Author

The idea is that, for some evaluations, some models are overfitting a given prompt format - so we want to evaluate a single evaluation on a range of prompt formats to mitigate this bias.

@anton-l
Copy link
Member

anton-l commented Nov 20, 2024

Ok we're on the same page then! Currently I'm checking that by copying/generating the tasks with different prompt functions and looking at the _average results as well as the individual prompts' results to see which ones skew them.

@clefourrier
Copy link
Member Author

Hm if you wanted to do this the best would do to define one task per prompt function instead I think

@clefourrier clefourrier merged commit 24d5feb into main Nov 22, 2024
2 checks passed
@NathanHB
Copy link
Member

Not sure how to implement prompt variation for fewshot but I think we will need it

@clefourrier
Copy link
Member Author

We would need to refacto entirely the few shot management. At the moment, all few shot docs are loaded once, but here we would need to dynamically apply the formatting. It's doable but will be a bit messy imo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants