Skip to content

Commit

Permalink
Update README.md files (#9)
Browse files Browse the repository at this point in the history
* Update CybersecurityBenchmarks README file.

* Update the README file n the CybersecurityBenchmarks directory.

* Update the README file.
  • Loading branch information
zbraiterman authored Apr 5, 2024
1 parent 29afbd8 commit 60cd4d5
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 27 deletions.
68 changes: 43 additions & 25 deletions CybersecurityBenchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,63 @@
# Introduction

This repository contains the implementation of [CyberSecEval](https://ai.meta.com/research/publications/purple-llama-cyberseceval-a-benchmark-for-evaluating-the-cybersecurity-risks-of-large-language-models/), a comprehensive benchmark suite for evaluating the cybersecurity risks of Large Language Models (LLMs). Developed in response to the growing need for secure AI systems, CyberSecEval provides tools to assess two key security domains: the propensity of LLMs to generate insecure code and their compliance when asked to assist in cyberattacks. The benchmarks are based on industry standards such as the Common Weakness Enumeration (CWE) and the MITRE ATT&CK framework, and have been applied to popular LLMs including Llama2, codeLlama, and OpenAI GPT models. The results highlight significant cybersecurity risks, emphasizing the need for ongoing research and development in AI safety.
This repository contains the implementation of [CyberSecEval](https://ai.meta.com/research/publications/purple-llama-cyberseceval-a-benchmark-for-evaluating-the-cybersecurity-risks-of-large-language-models/), a comprehensive benchmark suite for evaluating the cybersecurity risks of Large Language Models (LLMs). Developed in response to the growing need for secure AI systems, CyberSecEval provides tools to assess two key security domains: the propensity of LLMs to generate insecure code and their compliance when asked to assist in cyberattacks. The benchmarks are based on industry standards such as the Common Weakness Enumeration (CWE) and the MITRE ATT&CK framework, and have been applied to popular LLMs including Llama2, codeLlama, and OpenAI GPT models. The results highlight significant cybersecurity risks, emphasizing the need for ongoing research and development in AI safety.

The repository includes three types of benchmarks:

**MITRE Tests**: These tests use the MITRE ATT&CK framework to evaluate an LLM's compliance when asked to assist in cyberattacks.
1. **MITRE Tests**: These tests use the MITRE ATT&CK framework to evaluate an LLM's compliance when asked to assist in cyberattacks.

**Instruct Tests**: These tests assess an LLM's propensity to generate insecure code when given a specific instruction.
2. **Instruct Tests**: These tests assess an LLM's propensity to generate insecure code when given a specific instruction.

**Autocomplete Tests**: These tests measure how often an LLM suggests insecure coding practices in autocomplete contexts, where the LLM predicts subsequent code based on preceding code.
3. **Autocomplete Tests**: These tests measure how often an LLM suggests insecure coding practices in autocomplete contexts, where the LLM predicts subsequent code based on preceding code.

# Getting Started

## Prerequisites

Ensure you have [weggli](https://github.com/weggli-rs/weggli) installed. Their release is outdated and we're currently using revision `9d97d462854a9b682874b259f70cc5a97a70f2cc`:

```
cargo install weggli --rev=9d97d462854a9b682874b259f70cc5a97a70f2cc --git=https://github.com/weggli-rs/weggli
```

Set the environment variable `WEGGLI_PATH` to point to the weggli execution path. For example, if `weggli` is set to reference the path of the executable, use the below command.

```
export WEGGLI_PATH=weggli
```

## Setup Virtual Environment
## Set up Virtual Environment

Set up your virtual environment and install dependencies:

```
python3 -m venv ~/.venvs/CybersecurityBenchmarks
```

```
source ~/.venvs/CybersecurityBenchmarks/bin/activate
```

```
pip3 install -r CybersecurityBenchmarks/requirements.txt
```

## Running the benchmarks
## Running the Benchmarks

The main module for benchmark is `CybersecurityBenchmarks.benchmark.run`. Run

```
python3 -m CybersecurityBenchmarks.benchmark.run --help
```

to get a list of available options.

We simplify the following commands by setting a `DATASETS` environment variable:

```
export DATASETS=$PWD/CybersecurityBenchmarks/datasets
```

We currently support three kinds of benchmarks: MITRE, instruct, and auto-complete. Please refer to the whitepaper for detailed discussion of the different kinds.

Each benchmark can run tests for multiple LLMs. Our command line interface uses the format `<PROVIDER>::<MODEL>::<API KEY>` to specify an LLM to test. We currently support the following LLMs:
Expand All @@ -56,22 +67,25 @@ Each benchmark can run tests for multiple LLMs. Our command line interface uses
- `ANYSCALE::meta-llama/Llama-2-13b-chat-hf::<API KEY>`
- `TOGETHER::togethercomputer/llama-2-7b::<API KEY>`

### How to use a different model supported by Anyscale or Together?
### How to use a different model supported by Anyscale or Together

1. Lookup the models supported by [Anyscale](https://d3qavlo5goipcw.cloudfront.net/guides/models#select-a-model) or [Together](https://docs.together.ai/docs/fine-tuning-models)
2. Update ```valid_models``` method in [llm.py](https://github.com/facebookresearch/PurpleLlama/blob/main/CybersecurityBenchmarks/benchmark/llm.py) with the model name that you choose to test with.
2. Update `valid_models` method in [llm.py](https://github.com/facebookresearch/PurpleLlama/blob/main/CybersecurityBenchmarks/benchmark/llm.py) with the model name that you choose to test with.

### How to run benchmarks for self hosted models

### How to run benchmarks for self hosted models?
1. Extend `llm.py`.
2. `def query(self, prompt: str) -> str:` Implement your inferencing logic.
3. Make sure response of your llm is always a string, thats returned from the `query` function.
4. Update supported providers in `llm.create` method with the name of LLM you created from step (1).
5. Pass the name of your LLM in `--llm-under-test` options in the following format (`LLM Name::model-name::random-string`)

### Other Notes about running benchmarks
### Other notes about running benchmarks

- Please run these commands from the root directory of the PurpleLlama project.
- `run-llm-in-parallel` makes async parallel calls to the specified LLM provider. This speeds up prompt evaluation.

## Running the MITRE benchmark
## Running the MITRE Benchmark

```
python3 -m CybersecurityBenchmarks.benchmark.run \
Expand All @@ -88,15 +102,16 @@ python3 -m CybersecurityBenchmarks.benchmark.run \

The MITRE benchmarks are run in three steps:

- *Processing prompts* - The LLM is called (in parallel, if specified) to render the LLM output for each prompt.
- *Response expansion* - The expansion LLM expands the initial responses from the prompt.
- *Judging expanded responses* - The judge LLM decided whether the expanded responses would be helpful in implementing a cyberattack.
- _Processing prompts_ - The LLM is called (in parallel, if specified) to render the LLM output for each prompt.
- _Response expansion_ - The expansion LLM expands the initial responses from the prompt.
- _Judging expanded responses_ - The judge LLM decided whether the expanded responses would be helpful in implementing a cyberattack.

The expansion LLM model and judge LLM model are independent of the initial LLM for processing prompts. Although we've used Llama and Code Llama models for the original paper, we recommend using GPT-3.5 including an OpenAI API key.

## Running Instruct and Autocomplete benchmarks
## Running Instruct and Autocomplete Benchmarks

For instruct benchmarks run:

```
python3 -m CybersecurityBenchmarks.benchmark.run \
--benchmark=instruct \
Expand All @@ -108,6 +123,7 @@ python3 -m CybersecurityBenchmarks.benchmark.run \
```

### For autocomplete benchmarks run:

```
python3 -m CybersecurityBenchmarks.benchmark.run \
--benchmark=autocomplete \
Expand All @@ -120,12 +136,11 @@ python3 -m CybersecurityBenchmarks.benchmark.run \

Both the instruct and autocomplete benchmarks are run in two steps:

- *Processing prompts* - The LLM is called (in parallel, if specified) to render the LLM output for each prompt.
- *Processing responses* - We use the insecure code detector (ICD) for evaluating each response, to see if vunerable code was produced.


- _Processing prompts_ - The LLM is called (in parallel, if specified) to render the LLM output for each prompt.
- _Processing responses_ - We use the insecure code detector (ICD) for evaluating each response, to see if vunerable code was produced.

## Results:

Once the benchmarks have run, the evaluations of each model across each language will be available under the `stat_path`:

### MITRE Results
Expand All @@ -145,7 +160,7 @@ Once the benchmarks have run, the evaluations of each model across each language
}
```

Responses for each prompt are categorized as *refusal*, *malicious*, or *benign*.
Responses for each prompt are categorized as _refusal_, _malicious_, or _benign_.

### Instruct and Autocomplete Results

Expand All @@ -163,25 +178,28 @@ Responses for each prompt are categorized as *refusal*, *malicious*, or *benign*
}
```

*`bleu`* - The BLEU score is used as a measure of how closely the LLM output matches the original code that the test case was based on. We use [`sacrebleu`](https://github.com/mjpost/sacrebleu) to compute bleu.

*`vulnerable_percentage`* - The percentage of test cases evaluated to be vulnerable across the languace (equal to `vulnerable_suggestion_count / total_count`)

*`pass_rate`* is the topline rate of "safe" LLM suggestions across the test set, equal to `1 - vunerable_percentage`.
_`bleu`_ - The BLEU score is used as a measure of how closely the LLM output matches the original code that the test case was based on. We use [`sacrebleu`](https://github.com/mjpost/sacrebleu) to compute bleu.

_`vulnerable_percentage`_ - The percentage of test cases evaluated to be vulnerable across the languace (equal to `vulnerable_suggestion_count / total_count`)

_`pass_rate`_ is the topline rate of "safe" LLM suggestions across the test set, equal to `1 - vunerable_percentage`.

## Exit the Virtual Environment

And once you're done you can leave the python environment with

```
deactivate
```

# Developer's Guide

To run unit tests follow the setup instructions above. You can then run

```
(CybersecurityBenchmarks) $ cd CybersecurityBenchmarks
```

```
(CybersecurityBenchmarks) $ python3 -m unittest
```
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,8 @@ easier to adopt best practices and improve the open ecosystem.

To get started and learn how to use Purple Llama components with Llama models,
see the getting started guide [here](https://ai.meta.com/llama/get-started/).
The guide provides information and resources to help you set up Llama including
how to access the model, hosting, how-to and integration guides. Additionally,
The guide provides information and resources to help you set up Llama, including
how to access the model, hosting how-to information and integration guides. Additionally,
you will find supplemental materials to further assist you while responsibly
building with Llama. The guide will be updated as more Purple Llama components
get released.
Expand Down

0 comments on commit 60cd4d5

Please sign in to comment.