Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capability eval suite #28

Open
jemc opened this issue May 10, 2024 · 2 comments
Open

Capability eval suite #28

jemc opened this issue May 10, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@jemc
Copy link
Collaborator

jemc commented May 10, 2024

In our ad hoc testing with KurtOpenAI and KurtVertexAI, we have seen problems like:

  • Vertex AI sometimes failing to generate valid data, even when supposedly being forced to by the relevant API parameters
  • Vertex AI sometimes returning 500 errors

We want to be able to formalize this kind of testing for any LLM provider, so we can share empirically validated findings about the relative capabilities of different LLM providers within the context of the features that are important for Kurt users.

I envision:

  • a script that can be used to generate an output report for a given LLM provider / model
  • a way of storing those results in a repository (either this one, or maybe a separate one dedicated to this work)
  • a nicely readable way (markdown? HTML?) of seeing the current summary of capabilities across LLM providers / models
  • a blog post showing our findings across the "big three" LLMs (GPT, Gemini, Claude)
@jemc jemc added the enhancement New feature or request label May 10, 2024
@jemc
Copy link
Collaborator Author

jemc commented May 29, 2024

I've noticed some changes today in VertexAI behavior - I haven't tested extensively but it seems more reliable than before.

This is the kind of situation where it would be helpful to have the capability eval suite I could run, to comprehensively re-test all the various situations where we've found limitations before.

@jemc
Copy link
Collaborator Author

jemc commented Aug 1, 2024

Another issue I found with VertexAI to add to the eval suite: it seems incapable of generating an apostrophe character inside a structured data string field - likely because they are using single-quoted strings under the hood, and the model hasn't been trained to generate an escaped apostrophe character.

Currently, as soon as it encounters an apostrophe in the text it's trying to generate in such a field, Gemini will end the string instead of continuing to generate the rest of the text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant