You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's been observed that while gpt-4o-2024-05-13 is quite reliable at forced tool calling under Kurt, newer snapshots of gpt-4o are not at all reliable with this (they often respond with natural language even when tool calling is supposed to be forced).
It's my hypothesis that OpenAI made the existing forced tool calling weaker somehow when they added constrained token sampling features (which is a stronger feature), available only in the newer snapshots.
Unfortunately, the set of allowed JSON Schemas for constrained token sampling is smaller than the set allowed for tools on gpt-4o-2024-05-13 so this is in some sense a regression, while in another sense a leap forward (constrained token sampling is a stronger guarantee).
We need to find a suitable approach for dealing with this. There are three parts to this solution, probably:
make the new constrained token sampling mode available as a new option in KurtSamplingOptions, to let applications opt into this stronger guarantee, while accepting the resulting JSON Schema limitations
fiddle with other new API options to try to make the current forced tool calling mode more reliable on newer snapshots (ideally, at least as reliable as it is/was on the older gpt-4o-2024-05-13 snapshot)
update the set of known models to include the newer snapshots, making it easier to distinguish these in testing
Also this highlights the need for a capability eval suite as described in #28, which would have caught this problem faster, and could be used to make conclusive empirical statements about this kind of regression.
The text was updated successfully, but these errors were encountered:
This maps to the new-ish `strict: true` feature of OpenAI which
enables constrained token sampling, but has certain caveats that
make it undesirable to turn on by default.
See issue #61 for more info.
This maps to the new-ish `strict: true` feature of OpenAI which
enables constrained token sampling, but has certain caveats that
make it undesirable to turn on by default.
See issue #61 for more info.
It's been observed that while
gpt-4o-2024-05-13
is quite reliable at forced tool calling under Kurt, newer snapshots ofgpt-4o
are not at all reliable with this (they often respond with natural language even when tool calling is supposed to be forced).It's my hypothesis that OpenAI made the existing forced tool calling weaker somehow when they added constrained token sampling features (which is a stronger feature), available only in the newer snapshots.
Unfortunately, the set of allowed JSON Schemas for constrained token sampling is smaller than the set allowed for tools on
gpt-4o-2024-05-13
so this is in some sense a regression, while in another sense a leap forward (constrained token sampling is a stronger guarantee).We need to find a suitable approach for dealing with this. There are three parts to this solution, probably:
KurtSamplingOptions
, to let applications opt into this stronger guarantee, while accepting the resulting JSON Schema limitationsgpt-4o-2024-05-13
snapshot)Also this highlights the need for a capability eval suite as described in #28, which would have caught this problem faster, and could be used to make conclusive empirical statements about this kind of regression.
The text was updated successfully, but these errors were encountered: