From 05b3999441dd7a0d819e7e27b700cf8e7042625d Mon Sep 17 00:00:00 2001 From: Joseph Spisak Date: Fri, 8 Dec 2023 07:25:36 -0800 Subject: [PATCH] Update MODEL_CARD.md Updated the paper link and fixed the image link. --- Llama-Guard/MODEL_CARD.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/Llama-Guard/MODEL_CARD.md b/Llama-Guard/MODEL_CARD.md index 944a593e1..0e83754f4 100644 --- a/Llama-Guard/MODEL_CARD.md +++ b/Llama-Guard/MODEL_CARD.md @@ -8,7 +8,9 @@ It acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories. Here is an example: -![](Llama Guard_example.png) +

+ +

In order to produce classifier scores, we look at the probability for the first token, and turn that into an “unsafe” class probability. Model users can then @@ -96,7 +98,7 @@ include [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat) and Note: comparisons are not exactly apples-to-apples due to mismatches in each taxonomy. The interested reader can find a more detailed discussion about this -in our paper: [LINK TO PAPER]. +in our [paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/). | | Our Test Set (Prompt) | OpenAI Mod | ToxicChat | Our Test Set (Response) | | --------------- | --------------------- | ---------- | --------- | ----------------------- |