Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix structured data generation for newer gpt-4o models. #61

Closed
jemc opened this issue Dec 6, 2024 · 0 comments
Closed

Fix structured data generation for newer gpt-4o models. #61

jemc opened this issue Dec 6, 2024 · 0 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@jemc
Copy link
Collaborator

jemc commented Dec 6, 2024

It's been observed that while gpt-4o-2024-05-13 is quite reliable at forced tool calling under Kurt, newer snapshots of gpt-4o are not at all reliable with this (they often respond with natural language even when tool calling is supposed to be forced).

It's my hypothesis that OpenAI made the existing forced tool calling weaker somehow when they added constrained token sampling features (which is a stronger feature), available only in the newer snapshots.

Unfortunately, the set of allowed JSON Schemas for constrained token sampling is smaller than the set allowed for tools on gpt-4o-2024-05-13 so this is in some sense a regression, while in another sense a leap forward (constrained token sampling is a stronger guarantee).

We need to find a suitable approach for dealing with this. There are three parts to this solution, probably:

  • make the new constrained token sampling mode available as a new option in KurtSamplingOptions, to let applications opt into this stronger guarantee, while accepting the resulting JSON Schema limitations
  • fiddle with other new API options to try to make the current forced tool calling mode more reliable on newer snapshots (ideally, at least as reliable as it is/was on the older gpt-4o-2024-05-13 snapshot)
  • update the set of known models to include the newer snapshots, making it easier to distinguish these in testing

Also this highlights the need for a capability eval suite as described in #28, which would have caught this problem faster, and could be used to make conclusive empirical statements about this kind of regression.

@jemc jemc added the bug Something isn't working label Dec 6, 2024
@jemc jemc changed the title Fix structed data generation for newer gpt-4o models. Fix structureed data generation for newer gpt-4o models. Dec 6, 2024
@jemc jemc changed the title Fix structureed data generation for newer gpt-4o models. Fix structured data generation for newer gpt-4o models. Dec 6, 2024
@jemc jemc added the enhancement New feature or request label Dec 6, 2024
jemc added a commit that referenced this issue Dec 6, 2024
This maps to the new-ish `strict: true` feature of OpenAI which
enables constrained token sampling, but has certain caveats that
make it undesirable to turn on by default.

See issue #61 for more info.
jemc added a commit that referenced this issue Dec 6, 2024
This maps to the new-ish `strict: true` feature of OpenAI which
enables constrained token sampling, but has certain caveats that
make it undesirable to turn on by default.

See issue #61 for more info.
@jemc jemc closed this as completed Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant