Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: eval plugin docs #1814

Merged
merged 4 commits into from
Feb 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 14 additions & 16 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ This section explains how to perform inference-based evaluation using Genkit.

### Setup
<ol>
<li>Use an existing Genkit app or create a new one by following our [Getting
started](get-started) guide.</li>
<li>Use an existing Genkit app or create a new one by following our [Get
started](get-started.md) guide.</li>
<li>Add the following code to define a simple RAG application to evaluate. For
this guide, we use a dummy retriever that always returns the same documents.

Expand All @@ -52,7 +52,6 @@ import { genkit, z, Document } from "genkit";
import {
googleAI,
gemini15Flash,
gemini15Pro,
} from "@genkit-ai/googleai";

// Initialize Genkit
Expand Down Expand Up @@ -163,7 +162,7 @@ to open the Datasets page.
c. Repeat steps (a) and (b) a couple more times to add more examples. This
guide adds the following example inputs to the dataset:

```
```none {:.devsite-disable-click-to-copy}
"Can I give milk to my cats?"
"From which animals did dogs evolve?"
```
Expand All @@ -173,8 +172,8 @@ to open the Datasets page.

### Run evaluation and view results

To start evaluating the flow, click the `Evaluations` tab in the Dev UI and
click the **Run new evaluation** button to get started.
To start evaluating the flow, click the **Run new evaluation** button on your
dataset page. You can also start a new evaluation from the `Evaluations` tab.

1. Select the `Flow` radio button to evaluate a flow.

Expand Down Expand Up @@ -233,7 +232,7 @@ and is only enforced if a schema is specified on the target flow.
control for advanced use cases (e.g. providing model parameters, message
history, tools, etc). You can find the full schema for `GenerateRequest` in
our [API reference
docs](https://js.api.genkit.dev/interfaces/genkit._.GenerateRequest.html).
docs](https://js.api.genkit.dev/interfaces/genkit._.GenerateRequest.html){: .external}.

Note: Schema validation is a helper tool for editing examples, but it is
possible to save an example with invalid schema. These examples may fail when
Expand All @@ -244,7 +243,7 @@ the running an evaluation.
### Genkit evaluators

Genkit includes a small number of native evaluators, inspired by
[RAGAS](https://docs.ragas.io/en/stable/), to help you get started:
[RAGAS](https://docs.ragas.io/en/stable/){: .external}, to help you get started:

* Faithfulness -- Measures the factual consistency of the generated answer
against the given context
Expand All @@ -256,7 +255,7 @@ harm, or exploit
### Evaluator plugins

Genkit supports additional evaluators through plugins, like the Vertex Rapid
Evaluators, which you access via the [VertexAI
Evaluators, which you can access via the [VertexAI
Plugin](./plugins/vertex-ai#evaluators).

## Advanced use
Expand Down Expand Up @@ -316,7 +315,7 @@ for evaluation. To run on a subset of the configured evaluators, use the
`--evaluators` flag and provide a comma-separated list of evaluators by name:

```posix-terminal
genkit eval:flow qaFlow --input testInputs.json --evaluators=genkit/faithfulness,genkit/answer_relevancy
genkit eval:flow qaFlow --input testInputs.json --evaluators=genkitEval/maliciousness,genkitEval/answer_relevancy
```
You can view the results of your evaluation run in the Dev UI at
`localhost:4000/evaluate`.
Expand Down Expand Up @@ -393,9 +392,8 @@ export const qaFlow = ai.defineFlow({
const factDocs = await ai.retrieve({
retriever: dummyRetriever,
query,
options: { k: 2 },
});
const factDocsModified = await run('factModified', async () => {
const factDocsModified = await ai.run('factModified', async () => {
// Let us use only facts that are considered silly. This is a
// hypothetical step for demo purposes, you may perform any
// arbitrary task inside a step and reference it in custom
Expand All @@ -408,7 +406,7 @@ export const qaFlow = ai.defineFlow({
const llmResponse = await ai.generate({
model: gemini15Flash,
prompt: `Answer this question with the given context ${query}`,
docs: factDocs,
docs: factDocsModified,
});
return llmResponse.text;
}
Expand Down Expand Up @@ -482,7 +480,7 @@ Here is an example flow that uses a PDF file to generate potential user
questions.

```ts
import { genkit, run, z } from "genkit";
import { genkit, z } from "genkit";
import { googleAI, gemini15Flash } from "@genkit-ai/googleai";
import { chunk } from "llm-chunk"; // npm i llm-chunk
import path from "path";
Expand Down Expand Up @@ -515,9 +513,9 @@ export const synthesizeQuestions = ai.defineFlow(
async (filePath) => {
filePath = path.resolve(filePath);
// `extractText` loads the PDF and extracts its contents as text.
const pdfTxt = await run("extract-text", () => extractText(filePath));
const pdfTxt = await ai.run("extract-text", () => extractText(filePath));

const chunks = await run("chunk-it", async () =>
const chunks = await ai.run("chunk-it", async () =>
chunk(pdfTxt, chunkingConfig)
);

Expand Down
34 changes: 16 additions & 18 deletions docs/plugin-authoring-evaluator.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,23 +61,22 @@ function getDeliciousnessPrompt(ai: Genkit) {
output: {
schema: DeliciousnessDetectionResponseSchema,
}
},
`You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.
prompt: `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.

Examples:
Output: Chicken parm sandwich
Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }
Examples:
Output: Chicken parm sandwich
Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }

Output: Boston Logan Airport tarmac
Response: { "reason": "Not edible.", "verdict": "no" }
Output: Boston Logan Airport tarmac
Response: { "reason": "Not edible.", "verdict": "no" }

Output: A juicy piece of gossip
Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }
Output: A juicy piece of gossip
Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }

New Output: {% verbatim %}{{ responseToTest }} {% endverbatim %}
Response:
`
);
New Output: {% verbatim %}{{ responseToTest }} {% endverbatim %}
Response:
`
});
}
```

Expand All @@ -91,7 +90,7 @@ responsibility of the evaluator to validate that all fields required for
evaluation are present.

```ts
import { ModelArgument, z } from 'genkit';
import { ModelArgument } from 'genkit';
import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

/**
Expand All @@ -100,6 +99,7 @@ import { BaseEvalDataPoint, Score } from 'genkit/evaluator';
export async function deliciousnessScore<
CustomModelOptions extends z.ZodTypeAny,
>(
ai: Genkit,
judgeLlm: ModelArgument<CustomModelOptions>,
dataPoint: BaseEvalDataPoint,
judgeConfig?: CustomModelOptions
Expand Down Expand Up @@ -141,8 +141,7 @@ export async function deliciousnessScore<
The final step is to write a function that defines the `EvaluatorAction`.

```ts
import { Genkit, z } from 'genkit';
import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';
import { EvaluatorAction } from 'genkit/evaluator';

/**
* Create the Deliciousness evaluator action.
Expand All @@ -162,7 +161,7 @@ export function createDeliciousnessEvaluator<
isBilled: true,
},
async (datapoint: BaseEvalDataPoint) => {
const score = await deliciousnessScore(judge, datapoint, judgeConfig);
const score = await deliciousnessScore(ai, judge, datapoint, judgeConfig);
return {
testCaseId: datapoint.testCaseId,
evaluation: score,
Expand Down Expand Up @@ -245,7 +244,6 @@ As with the LLM-based evaluator, define the scoring function. In this case,
the scoring function does not need a judge LLM.

```ts
import { EvalResponses } from 'genkit';
import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

const US_PHONE_REGEX =
Expand Down
Loading