Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable valid OpenAI response_format specification #1069

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

liamcripwell
Copy link

When specifying the response_format within the current version of the OpenAI API it expects this via an object containing a "type" attribute, e.g. {"type": "<type>"}. However, distilabel is enforcing a string representation for this, which leads to either an error or silent failure.

E.g. when using a TextGeneration task under the existing codebase:

text_gen = TextGeneration(
    llm=OpenAILLM(
        model="gpt-4o",
        generation_kwargs={
            "response_format": "json"
        },
    )
)
text_gen.load()

output = next(
    text_gen.process(
        [{"instruction": "Convert this info to a JSON: John Smith is 30 years old."}]
    )
)

The OpenAI API will fail and yield BadRequestError: Error code: 400 - {'error': {'message': "Invalid type for 'response_format': expected an object, but got a string instead.", 'type': 'invalid_request_error', 'param': 'response_format', 'code': 'invalid_type'}}.

The same happens when directly calling generation from the LLM:

llm = OpenAILLM(
    model="gpt-4o",
)

llm.load()

output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

Presumably the same happens for requests to the batch api, which ultimately leads to AssertionError: No output file ID was found in the batch..

load = llm = OpenAILLM(
    model="gpt-4o",
    use_offline_batch_generation=True,
    offline_batch_generation_block_until_done=2,  # poll for results every 5 seconds
)

llm.load()
output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

This pr simply wraps the string representation of the specified response_format inside the object expected by OpenAI.
I have also added the same value checking that is done in agenerate() to offline_batch_generate().

@plaguss
Copy link
Contributor

plaguss commented Nov 25, 2024

Hi @liamcripwell thanks for the PR! This bug was found and is already fixed in develop: updated agenerate method. The next release will include this fixed

@liamcripwell
Copy link
Author

Hi @plaguss, great to hear it's already been fixed. Sorry, I didn't notice this change in develop.

However, I still think the doctstring for agenerate should be further updated because it still says that response_format must be either "text" or "json". This is no longer true as the method now only accepts a dictionary and will fail the pydantic validation if a string is provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants