[BUG] Wrong instance passed in the `inputs` argument of `.format_output()` #780

bergr7 · 2024-07-12T09:39:34Z

Describe the bug
The first instance in the batch is always passed to inputs argument of the .format_output() method of the Task class. This can create input - output mismatches e.g the data in inputs is used for creating metadata for the generated output.

Note the right instance is for formatting the input and generating the output.

To work around this, I had to set input_batch_size=1.

To Reproduce

I've created a simplified version of my code to reproduce the bug.

Code to reproduce

from distilabel.steps.tasks import Task
from distilabel.steps.tasks.typing import ChatType
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts, KeepColumns

from typing import List, Dict, Any

class MyCustomTask(Task):
    @property
    def inputs(self) -> List[str]:
        return ["input", "metadata"]

    @property
    def outputs(self) -> List[str]:
        return ["output", "metadata"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        return [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": input["input"],
            },
        ]

    def format_output(self, output: str, inputs: Dict[str, Any]) -> Dict[str, Any]:
        # ! inputs always store the first instance in the batch
        # ! If inputs is used for creating metadata or similar
        # ! it creates an input - output mismatch
        metadata = {
            "parent_record_id": inputs["metadata"]["record_id"],
            "parent_record_type": inputs["metadata"]["record_type"],
        }

        return {"output": output, "metadata": metadata}


llm = OpenAILLM(model="gpt-4o")

# some dummy data
data = [
    {"input": "Hello, how are you?", "metadata": {"record_id": 1, "record_type": "user"}},
    {"input": "I'm doing well, thanks!", "metadata": {"record_id": 2, "record_type": "assistant"}},
    {"input": "How can I help you today?", "metadata": {"record_id": 3, "record_type": "user"}},
    {"input": "I'd like to book a flight.", "metadata": {"record_id": 4, "record_type": "assistant"}},
    {"input": "Can you please provide me with the flight details?", "metadata": {"record_id": 5, "record_type": "user"}},
    {"input": "Sure, I'll book the flight for you.", "metadata": {"record_id": 6, "record_type": "assistant"}},
    {"input": "Thank you for your booking!", "metadata": {"record_id": 7, "record_type": "user"}},
]


with Pipeline("my_pipeline") as pipeline:
    load_dataset = LoadDataFromDicts(
        name="load_dataset",
        data=data,
    )

    task = MyCustomTask(
        name="run_my_custom_task",
        llm=llm,
        input_batch_size=2,
    )

    output_cols = KeepColumns(
        name="output_cols",
        columns=["output", "metadata"],
    )

    load_dataset >> task >> output_cols
    
    
distiset = pipeline.run(
    parameters={
        task.name: {
            "llm": {"generation_kwargs": {"max_new_tokens": 10}}
            }
        },
        use_cache=False,
)

Expected behaviour
inputs contains the instance used for generating the output, instead of the first instance in the batch always.

Desktop (please complete the following information):

Package version: 1.2.1
Python version: 3.10.13

The text was updated successfully, but these errors were encountered:

gabrielmbmb · 2024-07-12T10:03:17Z

Hi @bergr7, thanks for reporting! I'll work on fixing this

bergr7 · 2024-07-12T10:07:58Z

Hi @bergr7, thanks for reporting! I'll work on fixing this

Many thanks @gabrielmbmb !! :)

gabrielmbmb · 2024-07-12T11:13:02Z

Hi again @bergr7! We just released a new version 1.2.2 with the bug fixed. Thanks again for reporting!

bergr7 · 2024-07-15T07:53:36Z

Hi again @bergr7! We just released a new version 1.2.2 with the bug fixed. Thanks again for reporting!

Lightning fast! Thanks. I can confirm it's fixed and I've benefited from the fix already!!

gabrielmbmb self-assigned this Jul 12, 2024

gabrielmbmb linked a pull request Jul 12, 2024 that will close this issue

Fix passing input to format_output function #781

Merged

gabrielmbmb closed this as completed in #781 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Wrong instance passed in the `inputs` argument of `.format_output()` #780

[BUG] Wrong instance passed in the `inputs` argument of `.format_output()` #780

bergr7 commented Jul 12, 2024

gabrielmbmb commented Jul 12, 2024

bergr7 commented Jul 12, 2024

gabrielmbmb commented Jul 12, 2024

bergr7 commented Jul 15, 2024

[BUG] Wrong instance passed in the inputs argument of .format_output() #780

[BUG] Wrong instance passed in the inputs argument of .format_output() #780

Comments

bergr7 commented Jul 12, 2024

gabrielmbmb commented Jul 12, 2024

bergr7 commented Jul 12, 2024

gabrielmbmb commented Jul 12, 2024

bergr7 commented Jul 15, 2024

[BUG] Wrong instance passed in the `inputs` argument of `.format_output()` #780

[BUG] Wrong instance passed in the `inputs` argument of `.format_output()` #780