Skip to content

Latest commit

 

History

History
290 lines (204 loc) · 23.4 KB

README.md

File metadata and controls

290 lines (204 loc) · 23.4 KB

Guardrails Evaluation

NeMo Guardrails includes a set of tools that you can use to evaluate the different types of rails. In the current version, these tools test the performance of each type of rail individually. You can use the evaluation tools through the nemoguardrails CLI. Examples will be provided for each type of rail.

At the same time, we provide preliminary results on the performance of the rails on a set of public datasets that are relevant to each task at hand.

Dialog Rails

Aim and Usage

Dialog rails evaluation focuses on NeMo Guardrails's core mechanism to guide conversations using canonical forms and dialogue flows. More details about this core functionality are explained here.

Thus, when using dialog rails evaluation, we are assessing the performance for:

  1. User canonical form (intent) generation.
  2. Next step generation - in the current approach, we only assess the performance of bot canonical forms as next step in a flow.
  3. Bot message generation.

The CLI command for evaluating the dialog rails is:

$ nemoguardrails evaluate topical --config=<rails_app_path> --verbose

A dialog rails evaluation has the following CLI parameters:

  • config: The Guardrails app to be evaluated.
  • verbose: If the Guardrails app should be run in verbose mode.
  • test-percentage: Percentage of the samples for an intent to be used as test set.
  • max-tests-intent: Maximum number of test samples per intent to be used when testing (useful to have balanced test data for unbalanced datasets). If the value is 0, this parameter is not used.
  • max-samples-intent: Maximum number of samples per intent to be used in the vector database. If the value is 0, all samples not in test set are used.
  • results-frequency: If we want to print intermediate results about the current evaluation, this is the step.
  • sim-threshold: If larger than 0, for intents that do not have an exact match, pick the most similar intent above this threshold.
  • random-seed: Random seed used by the evaluation.
  • output-dir: Output directory for predictions.

Evaluation Results

For the initial evaluation experiments for dialog rails, we have used two datasets for conversational NLU:

The datasets were transformed into a NeMo Guardrails app by defining canonical forms for each intent, specific dialogue flows, and even bot messages (for the chit-chat dataset alone). The two datasets have a large number of user intents, thus dialog rails. One of them is very generic and has higher-grained intents (chit-chat), while the banking dataset is domain-specific and more fine-grained. More details about running the dialog rails evaluation experiments and the evaluation datasets are available here.

Preliminary evaluation results follow next. In all experiments, we have chosen to have a balanced test set with at most 3 samples per intent. For both datasets, we have assessed the performance for various LLMs and also for the number of samples (k = all, 3, 1) per intent that are indexed in the vector database.

Take into account that the performance of an LLM is heavily dependent on the prompt, especially due to the more complex prompt used by Guardrails. Therefore, currently, we only release the results for OpenAI models, but more results will follow in the next releases. All results are preliminary, as better prompting can improve them.

Important lessons to be learned from the evaluation results:

  • Each step in the three-step approach (user intent, next step/bot intent, bot message) used by Guardrails offers an improvement in performance.
  • It is important to have at least k=3 samples in the vector database for each user intent (canonical form) to achieve good performance.
  • Some models (e.g., gpt-3.5-turbo) produce a wider variety of canonical forms, even with the few-shot prompting used by Guardrails. In these cases, it is useful to add a similarity match instead of exact match for user intents. In this case, the similarity threshold becomes an important inference parameter.
  • Initial results show that even small models, e.g. dolly-v2-3b, vicuna-7b-v1.3, mpt-7b-instruct, falcon-7b-instruct have good performance for topical rails.
  • Using a single call for topical rails shows similar results to the default method (which uses up to 3 LLM calls for generating the final bot message) in most cases for text-davinci-003 model.
  • Initial experiments show that using compact prompts has similar or even better performance on these two datasets compared to using the longer prompts.

Evaluation Date - June 21, 2023. Updated July 24, 2023 for Dolly, Vicuna and Mosaic MPT models.

Dataset # intents # test samples
chit-chat 76 226
banking 77 231

Results on chit-chat dataset, metric used is accuracy.

Model User intent, w.o sim User intent, sim=0.6 Bot intent, w.o sim Bot intent, sim=0.6 Bot message, w.o sim Bot message, sim=0.6
text-davinci-003, k=all 0.89 0.89 0.90 0.90 0.91 0.91
text-davinci-003, k=all, single call 0.89 N/A 0.91 N/A 0.91 N/A
text-davinci-003, k=all, compact 0.90 N/A 0.91 N/A 0.91 N/A
text-davinci-003, k=3 0.82 N/A 0.85 N/A N/A N/A
text-davinci-003, k=1 0.65 N/A 0.73 N/A N/A N/A
gpt-3.5-turbo-instruct, k=all 0.88 N/A 0.88 N/A 0.88 N/A
gpt-3.5-turbo-instruct, single call 0.90 N/A 0.91 N/A 0.91 N/A
gpt-3.5-turbo-instruct, compact 0.89 N/A 0.89 N/A 0.90 N/A
gpt-3.5-turbo, k=all 0.44 0.56 0.50 0.61 0.54 0.65
llama2-13b-chat, k=all 0.87 N/A 0.88 N/A 0.89 N/A
dolly-v2-3b, k=all 0.80 0.82 0.81 0.83 0.81 0.83
vicuna-7b-v1.3, k=all 0.62 0.75 0.69 0.77 0.71 0.79
mpt-7b-instruct, k=all 0.73 0.81 0.78 0.82 0.80 0.82
falcon-7b-instruct, k=all 0.81 0.81 0.81 0.82 0.82 0.82

Results on banking dataset, metric used is accuracy.

Model User intent, w.o sim User intent, sim=0.6 Bot intent, w.o sim Bot intent, sim=0.6 Bot message, w.o sim Bot message, sim=0.6
text-davinci-003, k=all 0.77 0.82 0.83 0.84 N/A N/A
text-davinci-003, k=all, single call 0.75 N/A 0.81 N/A N/A N/A
text-davinci-003, k=all, compact 0.86 N/A 0.86 N/A N/A N/A
text-davinci-003, k=3 0.65 N/A 0.73 N/A N/A N/A
text-davinci-003, k=1 0.50 N/A 0.63 N/A N/A N/A
gpt-3.5-turbo-instruct, k=all 0.73 N/A 0.74 N/A N/A N/A
gpt-3.5-turbo-instruct, single call 0.81 N/A 0.83 N/A N/A N/A
gpt-3.5-turbo-instruct, compact 0.86 N/A 0.87 N/A N/A N/A
gpt-3.5-turbo, k=all 0.38 0.73 0.45 0.73 N/A N/A
llama2-13b-chat, k=all 0.76 N/A 0.77 N/A N/A N/A
dolly-v2-3b, k=all 0.32 0.62 0.40 0.64 N/A N/A
vicuna-7b-v1.3, k=all 0.39 0.62 0.54 0.65 N/A N/A
mpt-7b-instruct, k=all 0.45 0.58 0.50 0.60 N/A N/A
falcon-7b-instruct, k=all 0.70 0.75 0.76 0.78 N/A N/A

Input and Output Rails

Fact-checking Rails

In the Guardrails library, we provide two approaches out of the box for the fact-checking rail: the Self-Check fact-checking and AlignScore. For more details, read the library guide.

Self-Check

In this approach, the fact-checking rail is implemented as an entailment prediction problem. Given an evidence passage and the predicted answer, we prompt an LLM to predict yes/no whether the answer is grounded in the evidence or not. This is the default approach.

AlignScore

This approach is based on the AlignScore model Zha et al. 2023. Given an evidence passage and the predicted answer, the model is finetuned to predict that they are aligned when:

  1. All information in the predicted answer is present in the evidence passage, and
  2. None of the information in the predicted answer contradicts the evidence passage. The response is a value between 0.0 and 1.0. In our testing, the best average accuracies were observed with a threshold of 0.7.

Please see the user guide documentation for detailed steps on how to configure your deployment to use AlignScore.

Evaluation

To run the fact-checking rail, you can use the following CLI command:

$ nemoguardrails evaluate fact-checking --config=path/to/guardrails/config

Here is a list of arguments that you can use to configure the fact-checking rail:

  • config: The path to the guardrails configuration (this includes the LLM, the prompts and any other information).

  • dataset-path: Path to the dataset. It should be a JSON file with the following format:

    [
        {
            "question": "question text",
            "answer": "answer text",
            "evidence": "evidence text",
        },
    }
    
  • num-samples: Number of samples to run the eval on. The default is 50.

  • create-negatives: Whether to generate synthetic negative examples or not. The default is True.

  • output-dir: The directory to save the output to. The default is eval_outputs/factchecking.

  • write-outputs: Whether to write the outputs to a file or not. The default is True.

More details on how to set up the data in the right format and run the evaluation on your own dataset can be found here.

Evaluation Results

Evaluation Date - Nov 23, 2023.

We evaluate the performance of the fact-checking rail on the MSMARCO dataset using the Self-Check and the AlignScore approaches. To build the dataset, we randomly sample 100 (question, correct answer, evidence) triples, and then, for each triple, build a non-factual or incorrect answer to yield 100 (question, incorrect answer, evidence) triples.

We breakdown the performance into positive entailment accuracy and negative entailment accuracy. Positive entailment accuracy is the accuracy of the model in correctly identifying answers that are grounded in the evidence passage. Negative entailment accuracy is the accuracy of the model in correctly identifying answers that are not supported in the evidence. Details on how to create synthetic negative examples can be found here

Model Positive Entailment Accuracy Negative Entailment Accuracy Overall Accuracy Average Time Per Checked Fact (ms)
text-davinci-003 70.0% 93.0% 81.5% 272.2ms
gpt-3.5-turbo 76.0% 89.0% 82.5% 435.1ms
gpt-3.5-turbo-instruct 92.0% 69.0% 80.5% 188.8ms
align_score-base* 81.0% 88.0% 84.5% 23.0ms ^
align_score-large* 87.0% 90.0% 88.5% 46.0ms ^

*The threshold used for align_score is 0.7, i.e. an align_score >= 0.7 is considered a factual statement, and an align_score < 0.7 signifies an incorrect statement. ^When the AlignScore model is loaded in-memory and inference is carried out without network overheads, i.e., not as a RESTful service.

Moderation Rails

The moderation involves two components: jailbreak detection and output moderation.

  • The jailbreak detection attempts to flag user inputs that could potentially cause the model to output unsafe content.
  • The output moderation attempts to filter the language model output to avoid unsafe content from being displayed to the user.

Self-Check

This rail will prompt the LLM using a custom prompt for input (jailbreak) and output moderation. Common reasons for rejecting the input from the user include jailbreak attempts, harmful or abusive content, or other inappropriate instructions. For more details, consult the [Guardrails library](Guardrails library) guide.

Evaluation

The jailbreak and output moderation can be evaluated using the following CLI command:

$ nemoguardrails evaluate moderation --config=path/to/guardrails/config

The various arguments that can be passed to evaluate the moderation rails are

  • config: The path to the guardrails configuration (this includes the LLM, the prompts and any other information).
  • dataset-path: Path to the dataset to evaluate the rails on. The dataset should contain one prompt per line.
  • split: The split of the dataset to evaluate on. Choices are 'helpful' or 'harmful'. This selection is used to determine the appropriate label for the predictions.
  • num-samples: Number of samples to evaluate. Default is 50.
  • check-input: Whether to evaluate the jailbreak rail. Default is True.
  • check-output: Whether to evaluate the output moderation rail. Default is True.
  • output-path: Folder to write the results to. Default is 'eval_outputs/moderation'.

It is also possible to evaluate each of the rails individually. To evaluate the jailbreak rail only, use the following command:

nemoguardrails evaluate moderation --check-output False --config=path/to/guardrails/config

To evaluate the output moderation rail only, use the following command:

nemoguardrails evaluate moderation --check-input False --config=path/to/guardrails/config

More details on how to set up the data in the right format and run the evaluation on your own dataset can be found here.

Evaluation Results

We evaluate the moderation rails on the Anthropic Red Team Attempts dataset and the Anthropic Helpful Base dataset. This dataset contains prompts that are labeled by humans as either helpful or harmful. We randomly sample 100 prompts from each of the splits and run the evaluation using OpenAI text-davinci-003 and gpt-3.5-turbo models.

Evaluation Date - June 02, 2023.

We breakdown the performance of the models on the two rails into the following metrics:

  • % of the prompts that are blocked on the Red Team Attempts dataset
  • % of the prompts that are allowed on the Helpful Base dataset

We want the models to block as many harmful prompts as possible and allow as many helpful prompts as possible.

Moderation Rails Performance

These results are using the Simple prompt defined in the LLM Self-Checking method. For more details, see the Guardrails library.

Model % of harmful prompts blocked % of helpful prompts allowed
text-davinci-003 80 97
gpt-3.5-turbo 70 100
gpt-3.5-turbo-instruct 78 97
nemollm-43b 88 84

Hallucination Rails

For general questions that the model uses parametric knowledge to answer, we can define a hallucination rail to detect when the model is potentially making up facts. The default implementation of the hallucination rails is based on SelfCheckGPT.

  • Given a question, we sample multiple answers from the model, often at a high temperature (temp=1.0).
  • We then check if the answers are consistent with each other. This agreement check is implemented using an LLM call similar to the fact checking rail.
  • If the answers are inconsistent, it indicates that the model might be hallucinating.

Self-Check

This rail will use the LLM for self-checking with a custom prompt if the answers are inconsistent. The custom prompt can be similar to an NLI task. For more details, consult the [Guardrails library](Guardrails library) guide.

Evaluation

To run the hallucination rail, use the following CLI command:

$ nemoguardrails evaluate hallucination --config=path/to/guardrails/config

Here is a list of arguments that you can use to configure the hallucination rail:

  • config: The path to the guardrails configuration (this includes the LLM, the prompts and any other information).
  • dataset-path: Path to the dataset. It should be a text file with one question per line.
  • num-samples: Number of samples to run the eval on. Default is 50.
  • output-dir: The directory to save the output to. Default is eval_outputs/hallucination.
  • write-outputs: Whether to write the outputs to a file or not. Default is True.

To evaluate the hallucination rail on your own dataset, you can follow the create a text file with the list of questions and run the evaluation using the following command

nemoguardrails evaluate hallucination --dataset-path <path-to-your-text-file>

Evaluation Results

To evaluate the hallucination rail, we manually curate a set of questions which mainly consists of questions with a false premise, i.e., questions that cannot have a correct answer.

For example, the question "What is the capital of the moon?" has a false premise since the moon does not have a capital. Since the question is stated in a way that implies that the moon has a capital, the model might be tempted to make up a fact and answer the question.

We then run the hallucination rail on these questions and check if the model is able to detect the hallucination. We run the evaluation using OpenAI text-davinci-003 and gpt-3.5-turbo models.

Evaluation Date - June 12, 2023.

We breakdown the performance into the following metrics:

  • % of questions that are intercepted by the model, i.e., % of questions where the model detects are not answerable
  • % of questions that are intercepted by model + hallucination rail, i.e., % of questions where the either the model detects are not answerable or the hallucination rail detects that the model is making up facts
Model % intercepted - model % intercepted - model + hallucination rail
text-davinci-003 0 70
gpt-3.5-turbo 65 90

We find that gpt-3.5-turbo is able to intercept 65% of the questions and identify them as not answerable on its own. Adding the hallucination rail helps intercepts 25% more questions and prevents the model from making up facts.