This folder contains implementations to evaluate LLM360 models on BOLD dataset, which evaluates social biases in language models across five domains: profession, gender, race, religion, and political ideology.
The folder contains sentiment analysis for BOLD dataset. Amber and Crystal models are currently supported.
single_ckpt_bold_eval.py
is the main entrypoint for running BOLD evaluation on a single model. It uses python modules in utils/
folder.
The utils/
folder contains helper functions for model/dataset IO:
data_utils.py
: Prompt dataset utilsmodel_utils.py
: Model loader
The BOLD prompts are stored in ./data/prompts/
. By default, the model generations are saved in ./{prompt_file_name}_with_responses.jsonl
, and the evaluation results are saved in ./{model_name}_results.jsonl
.
- Clone and enter the folder:
git clone https://github.com/LLM360/Analysis360.git cd Analysis360/analysis/safety360/bold
- Install dependencies:
pip install -r requirements.txt
An example usage is provided in the demo.ipynb, which can be executed with a single A100 80G
GPU.