Skip to content

Evaluation and Testing

andymc629 edited this page Apr 11, 2024 · 10 revisions

Evaluation and Testing

The aim of InferGPT is to provide long-term memory and personalised interactions with users. How will we know when we have achieved this? We will need to have some evaluation experiments and clear testing benchmarks that show that this solution has done this more effectively than a default LLM chatbot.

Evaluating Long-term Memory (LTM)

We can evaluate whether or not the system is effectively retaining long-term knowledge of the users profile through processes like those below (ideas, WIP - 10/4/2024)

Evaluation LTM1 - GRAG precision

  1. Define a series of dummy user persona's.
  2. Generate synthetic history for those users and interactions.
  3. Initiate a new chat as dummy user 1 and perform evaluation of fact retrieval out of context (can reprupose RAG based metrics for this) - i.e ask questions and evaluate the accuracy/precision of these answers.
  4. Repeat across all users, augment datasets through perturbing histories.

Evaluation LTM2 - Failure resistence?

  1. Define a series of dummy user persona's.
  2. Generate synthetic history for those users and interactions.
  3. Initiate a new chat as dummy user 1 and submit adversarial examples that run counter to the stored history/wider context - i.e ask questions or make statements that are not correct for this user and evaluate if the bot responded appropriately (what is appropriately?).
  4. Repeat across all users, augment datasets through perturbing histories.

Evaluating Inference Capability (EIC)

We can evaluate how effective InferGPT is at guess the next best question and how many questions it takes to come to a confident conclusion (ideas, WIP - 11/4/2024)

EIC1 - inference intelligence?

  1. Define a dummy user personas
  2. Define a "correct" journey and quantity of questions for each user
  3. Generate synthetic history for those users and interactions - with key pieces of information about that person missing.
  4. initiate new chat as dummy user 1 and ask for assistance on a task or goal. Count how many quesitons it takes for the AI to gather the key information required to correctly complete task or suggest next steps.

Evaluating General Agentic Behaviour (GAB)

Evaluation GAB1

We may also want to evaluate the general agentic behavior of the solution ... leverage autogen and Gaia benchmarks.... TBC.

In the Stanford Generative Agents paper the agents deployed in the game style simulation are evaluated in two ways:

  1. "In the technical evaluation, we leverage a methodological opportunity to evaluate an agent’s knowledge and behavior by “interviewing” it in natural language to probe the agents’ ability to stay in character, remember, plan, react, and reflect accurately."

    a. It is mentioned in the paper that the dependent variable in this type of evaluation is the believability of the agent's behaviour.

    b. This evaluation is performed over 5 different categories: "maintaining self-knowledge, retrieving memory, generating plans, reacting and reflecting". Each category has five questions designed to provide insight as to the agent's believability in each of these categories. The full list is given in Appendix B of the paper.

    c. The questions used in this evaluation are focussed on the agent's holding of a long-term context, but for InferGPT we are more interested in the agent's retained context about the user. This should be factored into any experiment design.

  2. "...an end-to-end evaluation where the agents interacted with each other in open-ended ways over two days of game time to understand their stability and emergent social behaviors."

For our purposes, the first of these methods will be applicable, we will call this Evaluation GAB1.

Evaluation GAB2

For the next general evaluation set we will look at what the AutoGPT project have done.

AutoGPT has a benchmarking section on their wiki where they state:

"🎯 Benchmark Measure your agent's performance! The agbenchmark can be used with any agent that supports the agent protocol, and the integration with the project's CLI makes it even easier to use with AutoGPT and forge-based agents. The benchmark offers a stringent testing environment. Our framework allows for autonomous, objective performance evaluations, ensuring your agents are primed for real-world action.

📦 agbenchmark on Pypi  |  📘 Learn More about the Benchmark"