Allowing a single prompt to use several formats for one eval. #398

clefourrier · 2024-11-19T13:41:13Z

This can be tested with the following custom task.

from lighteval.metrics.metrics import Metrics
from lighteval.tasks.default_prompts import LETTER_INDICES
from lighteval.tasks.lighteval_task import LightevalTaskConfig
from lighteval.tasks.requests import Doc

def prompt_fn(line, task_name: str = None):
    """Defines how to go from a dataset line to a doc object.
    Follow examples in src/lighteval/tasks/tasks_prompt_formatting.py, or get more info
    about what this function should do in the README.
    """
    
    ix = line["__index"]

    if ix % 3 == 0:
        # question, must predict correct choice
        return Doc(
            task_name=task_name,
            query=f"Question: {line['question']}\nAnswer:",
            choices=[f" {c}" for c in line["choices"]["text"]],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )

    if (ix + 1) % 3 == 0:
        # question + label + choice, must predict correct label
        query = f"Question: {line['question']}\n"
        query += "".join([f"\n{key}. {choice}" for key, choice in zip(LETTER_INDICES, line["choices"]["text"])])
        query += "\nAnswer:"
        return Doc(
            task_name=task_name,
            query=query,
            choices=LETTER_INDICES[: len(line["choices"]["text"])],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )
    
    if (ix + 2) % 3 == 0:
        # question + choice, must predict correct choice
        query = f"Question: {line['question']}\n"
        query += "".join([f"\n- {choice}" for key, choice in zip(LETTER_INDICES, line["choices"]["text"])])
        query += "\nAnswer:"
        return Doc(
            task_name=task_name,
            query=query,
            choices=line["choices"]["text"],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )


task = LightevalTaskConfig(
    name="arc_multi_prompts",
    prompt_function=prompt_fn,
    hf_repo="ai2_arc",
    hf_subset="ARC-Challenge",
    evaluation_splits=["test"],
    generation_size=1,
    metric=[Metrics.loglikelihood_acc, Metrics.loglikelihood_acc_norm_nospace],
    trust_dataset=True,
    stop_sequence=["\n"],
)
TASKS_TABLE = [task]

if __name__ == "__main__":
    print(t.name for t in TASKS_TABLE)
    print(len(TASKS_TABLE))

Only question is the management of few shot: do we want to force few-shot samples to have the same format as the current question, or do we just assume that these tests will be run 0 shot?

If we force the few shot samples, that means we'll need to reload them, or change the loading system, since we fix the few shot sample shape at creation.

anton-l · 2024-11-20T13:12:53Z

My random 2 cents: if we're doing few-shots too, it would be useful to see the metrics separately for each prompt style too. And at that point we could just define them as separate tasks, but maybe I'm misunderstanding the use case 🤔

clefourrier · 2024-11-20T14:16:23Z

The idea is that, for some evaluations, some models are overfitting a given prompt format - so we want to evaluate a single evaluation on a range of prompt formats to mitigate this bias.

anton-l · 2024-11-20T14:48:45Z

Ok we're on the same page then! Currently I'm checking that by copying/generating the tasks with different prompt functions and looking at the _average results as well as the individual prompts' results to see which ones skew them.

clefourrier · 2024-11-22T09:37:51Z

Hm if you wanted to do this the best would do to define one task per prompt function instead I think

NathanHB · 2024-11-22T11:44:00Z

Not sure how to implement prompt variation for fewshot but I think we will need it

clefourrier · 2024-11-22T11:45:37Z

We would need to refacto entirely the few shot management. At the moment, all few shot docs are loaded once, but here we would need to dynamically apply the formatting. It's doable but will be a bit messy imo.

…face#398)

clefourrier and others added 2 commits November 19, 2024 10:58

adding index

247ddd4

Merge branch 'main' into use_several_formats

9036896

clefourrier requested a review from NathanHB November 19, 2024 15:57

Merge branch 'main' into use_several_formats

e8ac4a1

NathanHB approved these changes Nov 22, 2024

View reviewed changes

clefourrier merged commit 24d5feb into main Nov 22, 2024
2 checks passed

JoelNiklaus pushed a commit to JoelNiklaus/lighteval that referenced this pull request Nov 25, 2024

Allowing a single prompt to use several formats for one eval (hugging…

caeed6d

…face#398)

hynky1999 pushed a commit that referenced this pull request Nov 29, 2024

Allowing a single prompt to use several formats for one eval (#398)

ee81541

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing a single prompt to use several formats for one eval. #398

Allowing a single prompt to use several formats for one eval. #398

clefourrier commented Nov 19, 2024 •

edited

Loading

anton-l commented Nov 20, 2024

clefourrier commented Nov 20, 2024

anton-l commented Nov 20, 2024

clefourrier commented Nov 22, 2024

NathanHB commented Nov 22, 2024

clefourrier commented Nov 22, 2024

Allowing a single prompt to use several formats for one eval. #398

Allowing a single prompt to use several formats for one eval. #398

Conversation

clefourrier commented Nov 19, 2024 • edited Loading

anton-l commented Nov 20, 2024

clefourrier commented Nov 20, 2024

anton-l commented Nov 20, 2024

clefourrier commented Nov 22, 2024

NathanHB commented Nov 22, 2024

clefourrier commented Nov 22, 2024

clefourrier commented Nov 19, 2024 •

edited

Loading