-
Notifications
You must be signed in to change notification settings - Fork 24
Evaluation and Testing
The aim of InferGPT is to provide long-term memory and personalised interactions with users. How will we know when we have achieved this? We will need to have some evaluation experiments and clear testing benchmarks that show that this solution has done this more effectively than a default LLM chatbot.
We can evaluate whether or not the system is effectively retaining long-term knowledge of the users profile through processes like those below (ideas, WIP - 10/4/2024)
- Define a series of dummy user persona's.
- Generate synthetic history for those users and interactions.
- Initiate a new chat as dummy user 1 and perform evaluation of fact retrieval out of context (can reprupose RAG based metrics for this) - i.e ask questions and evaluate the accuracy/precision of these answers.
- Repeat across all users, augment datasets through perturbing histories.
- Define a series of dummy user persona's.
- Generate synthetic history for those users and interactions.
- Initiate a new chat as dummy user 1 and submit adversarial examples that run counter to the stored history/wider context - i.e ask questions or make statements that are not correct for this user and evaluate if the bot responded appropriately (what is appropriately?).
- Repeat across all users, augment datasets through perturbing histories.
Evaluating Inference Capability (EIC) We can evaluate how effective InferGPT is at guess the next best question and how many questions it takes to come to a confident conclusion (ideas, WIP - 11/4/2024)
- Define a dummy user persona
- Define a "correct" journey and quantity of questions
- Generate synthetic history for those users and interactions - with key pieces of information about that person
- initiate new chat as dummy user 1 and ask for assistance on a task or goal. Count how many quesitons it takes for the AI to correctly complete task or suggest next steps.
We may also want to evaluate the general agentic behavior of the solution ... leverage autogen and Gaia benchmarks.... TBC.