Skip to content

Evaluation and Testing

Chris Booth edited this page Apr 11, 2024 · 10 revisions

Evaluation and Testing

The aim of InferGPT is to provide long-term memory and personalised interactions with users. How will we know when we have achieved this? We will need to have some evaluation experiments and clear testing benchmarks that show that this solution has done this more effectively than a default LLM chatbot.

Evaluating Long-term Memory (LTM)

We can evaluate whether or not the system is effectively retaining long-term knowledge of the users profile through processes like those below (ideas, WIP - 10/4/2024)

Evaluation LTM1 - GRAG precision

  1. Define a series of dummy user persona's.
  2. Generate synthetic history for those users and interactions.
  3. Initiate a new chat as dummy user 1 and perform evaluation of fact retrieval out of context (can reprupose RAG based metrics for this) - i.e ask questions and evaluate the accuracy/precision of these answers.
  4. Repeat across all users, augment datasets through perturbing histories.

Evaluation LTM2 - Failure resistence?

  1. Define a series of dummy user persona's.
  2. Generate synthetic history for those users and interactions.
  3. Initiate a new chat as dummy user 1 and submit adversarial examples that run counter to the stored history/wider context - i.e ask questions or make statements that are not correct for this user and evaluate if the bot responded appropriately (what is appropriately?).
  4. Repeat across all users, augment datasets through perturbing histories.

Evaluating Inference Capability (EIC) We can evaluate how effective InferGPT is at guess the next best question and how many questions it takes to come to a confident conclusion (ideas, WIP - 11/4/2024)

Evaluation EIC - inference intelligence?

  1. Define a dummy user persona
  2. Define a "correct" journey and quantity of questions
  3. Generate synthetic history for those users and interactions - with key pieces of information about that person
  4. initiate new chat as dummy user 1 and ask for assistance on a task or goal. Count how many quesitons it takes for the AI to correctly complete task or suggest next steps.

General Agentic Evaluation

We may also want to evaluate the general agentic behavior of the solution ... leverage autogen and Gaia benchmarks.... TBC.

Clone this wiki locally