You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The first instance in the batch is always passed to inputs argument of the .format_output() method of the Task class. This can create input - output mismatches e.g the data in inputs is used for creating metadata for the generated output.
Note the right instance is for formatting the input and generating the output.
To work around this, I had to set input_batch_size=1.
To Reproduce
I've created a simplified version of my code to reproduce the bug.
Code to reproduce
fromdistilabel.steps.tasksimportTaskfromdistilabel.steps.tasks.typingimportChatTypefromdistilabel.llmsimportOpenAILLMfromdistilabel.pipelineimportPipelinefromdistilabel.stepsimportLoadDataFromDicts, KeepColumnsfromtypingimportList, Dict, AnyclassMyCustomTask(Task):
@propertydefinputs(self) ->List[str]:
return ["input", "metadata"]
@propertydefoutputs(self) ->List[str]:
return ["output", "metadata"]
defformat_input(self, input: Dict[str, Any]) ->ChatType:
return [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": input["input"],
},
]
defformat_output(self, output: str, inputs: Dict[str, Any]) ->Dict[str, Any]:
# ! inputs always store the first instance in the batch# ! If inputs is used for creating metadata or similar# ! it creates an input - output mismatchmetadata= {
"parent_record_id": inputs["metadata"]["record_id"],
"parent_record_type": inputs["metadata"]["record_type"],
}
return {"output": output, "metadata": metadata}
llm=OpenAILLM(model="gpt-4o")
# some dummy datadata= [
{"input": "Hello, how are you?", "metadata": {"record_id": 1, "record_type": "user"}},
{"input": "I'm doing well, thanks!", "metadata": {"record_id": 2, "record_type": "assistant"}},
{"input": "How can I help you today?", "metadata": {"record_id": 3, "record_type": "user"}},
{"input": "I'd like to book a flight.", "metadata": {"record_id": 4, "record_type": "assistant"}},
{"input": "Can you please provide me with the flight details?", "metadata": {"record_id": 5, "record_type": "user"}},
{"input": "Sure, I'll book the flight for you.", "metadata": {"record_id": 6, "record_type": "assistant"}},
{"input": "Thank you for your booking!", "metadata": {"record_id": 7, "record_type": "user"}},
]
withPipeline("my_pipeline") aspipeline:
load_dataset=LoadDataFromDicts(
name="load_dataset",
data=data,
)
task=MyCustomTask(
name="run_my_custom_task",
llm=llm,
input_batch_size=2,
)
output_cols=KeepColumns(
name="output_cols",
columns=["output", "metadata"],
)
load_dataset>>task>>output_colsdistiset=pipeline.run(
parameters={
task.name: {
"llm": {"generation_kwargs": {"max_new_tokens": 10}}
}
},
use_cache=False,
)
Expected behaviour inputs contains the instance used for generating the output, instead of the first instance in the batch always.
Desktop (please complete the following information):
Package version: 1.2.1
Python version: 3.10.13
The text was updated successfully, but these errors were encountered:
Describe the bug
The first instance in the batch is always passed to
inputs
argument of the.format_output()
method of theTask
class. This can create input - output mismatches e.g the data in inputs is used for creating metadata for the generated output.Note the right instance is for formatting the input and generating the output.
To work around this, I had to set input_batch_size=1.
To Reproduce
I've created a simplified version of my code to reproduce the bug.
Code to reproduce
Expected behaviour
inputs
contains the instance used for generating the output, instead of the first instance in the batch always.Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: