Attack to induce LLMs within hallucinations
-
Updated
May 17, 2024 - Python
Attack to induce LLMs within hallucinations
Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"
Restore safety in fine-tuned language models through task arithmetic
Code and dataset for the paper: "Can Editing LLMs Inject Harm?"
NeurIPS'24 - LLM Safety Landscape
Add a description, image, and links to the llm-safety topic page so that developers can more easily learn about it.
To associate your repository with the llm-safety topic, visit your repo's landing page and select "manage topics."