Skip to content

Commit

Permalink
Tweak installation / usage sections of README (#55)
Browse files Browse the repository at this point in the history
  • Loading branch information
lewtun authored Feb 26, 2024
1 parent 090101f commit 480d85e
Showing 1 changed file with 143 additions and 62 deletions.
205 changes: 143 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# LightEval 🌤️
A lightweight LLM evaluation

A lightweight framework for LLM evaluation

## Context
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library [datatrove](https://github.com/huggingface/datatrove) and LLM training library [nanotron](https://github.com/huggingface/nanotron).
Expand All @@ -12,95 +13,175 @@ In case of problems or question, feel free to open an issue!
## News
- **Feb 08, 2024**: Release of `lighteval`

## Deep thanks
`lighteval` was originally built on top of the great [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which is powering the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). We also took a lot of inspiration from the amazing [HELM](https://crfm.stanford.edu/helm/latest/), notably for metrics.

Through adding more and more logging functionalities, and making it compatible with increasingly different workflows and model codebases (including 3D parallelism) as well as allowing custom evaluation experiments, metrics and benchmarks, we ended up needing to change the code more and more deeply until `lighteval` became the small standalone library that it is now.
## Installation

However, we are very grateful to the Harness and HELM teams for their continued work on better evaluations.
Clone the repo:

## How to navigate this project
`lighteval` is supposed to be used as a standalone evaluation library.
- To run the evaluations, you can use `run_evals_accelerate.py` or `run_evals_nanotron.py`.
- [src/lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the lib itself
- [lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the library, divided in the following section
- [main_accelerate.py](https://github.com/huggingface/lighteval/blob/main/src/lighteval/main_accelerate.py) and [main_nanotron.py](https://github.com/huggingface/lighteval/blob/main/src/lighteval/main_nanotron.py) are our entry points to run evaluation
- [logging](https://github.com/huggingface/lighteval/tree/main/src/lighteval/logging): Our loggers, to display experiment information and push it to the hub after a run
- [metrics](https://github.com/huggingface/lighteval/tree/main/src/lighteval/metrics): All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
- [models](https://github.com/huggingface/lighteval/tree/main/src/lighteval/models): Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
- [tasks](https://github.com/huggingface/lighteval/tree/main/src/lighteval/tasks): Available tasks. The complete list is in `tasks_table.jsonl`, and you'll find all the prompts in `tasks_prompt_formatting.py`.
- [tasks_examples](https://github.com/huggingface/lighteval/tree/main/tasks_examples) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
- [tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.

## How to install and use

Note:
- Use the Eleuther AI Harness (`lm_eval`) to share comparable numbers with everyone (e.g. on the Open LLM Leaderboard).
- Use `lighteval` during training with the nanotron/datatrove LLM training stack and/or for quick eval/benchmark experimentations.

### Installation
Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10 or above.
```bash
conda create -n lighteval python==3.10
git clone https://github.com/huggingface/lighteval.git
cd lighteval
```

Clone the package
Create a virtual environment using virtualenv or conda depending on your preferences. We require Python 3.10 or above:

```bash
git clone https://github.com/huggingface/lighteval.git
cd lighteval
conda create -n lighteval python=3.10 && conda activate lighteval
```

Install the dependencies. For the default installation, you just need:

```bash
pip install -e .
pip install .
```

If you want to run your models using accelerate, tgi or optimum, do quantization, or use adapter weights, you will need to specify the optional dependencies group fitting your use case (`accelerate`,`tgi`,`optimum`,`quantization`,`adapters`,`nanotron`) at install time
If you want to evaluate models with frameworks like `accelerate` or `peft`, you will need to specify the optional dependencies group that fits your use case (`accelerate`,`tgi`,`optimum`,`quantization`,`adapters`,`nanotron`):

```bash
pip install -e .[optional1,optional2]
pip install '.[optional1,optional2]'
```

The setup we tested most is:
The setup tested most is:

```bash
pip install -e .[accelerate,quantization,adapters]
pip install '.[accelerate,quantization,adapters]'
```

If you want to push your results to the hub, don't forget to add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN`.
If you want to push your results to the Hugging Face Hub, don't forget to add your access token to the environment variable `HUGGING_FACE_HUB_TOKEN`. You can do this by running:

Lastly, if you intend to push to the code base, you'll need to install the precommit hook for styling tests.
```bash
pip install pre-commit
pre-commit install
```shell
huggingface-cli login
```

Optional steps.
and pasting your access token.

### Optional steps

- to load and push big models/datasets, your machine likely needs Git LFS. You can install it with `sudo apt-get install git-lfs`
- If you want to run bigbench evaluations, install bigbench `pip install "bigbench@https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"`

Lastly, if you intend to push to the code base, you'll need to install the precommit hook for styling tests:

### Testing that everything was installed correctly
If you want to test your install, you can run your first evaluation on GPUs (8GPU, single node), using
```bash
mkdir tmp
python -m accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir="tmp/"
pip install pre-commit
pre-commit install
```

## Usage

We provide two main entry points to evaluate models:

* `run_evals_accelerate.py`: evaluate models on CPU or one or more GPUs using [🤗 Accelerate](https://github.com/huggingface/accelerate).
* `run_evals_nanotron.py`: evaluate models in distributed settings using [⚡️ Nanotron](https://github.com/huggingface/nanotron).

For most users, we recommend using the 🤗 Accelerate backend - see below for specific commands.

### Evaluate a model on one or more GPUs (recommended)

To evaluate a model on one or more GPUs, first create a `multi-gpu` config by running:

```shell
accelerate config
```

You can then evaluate a model using data parallelism as follows:

```shell
accelerate launch --multi_gpu --num_processes=<num_gpus> run_evals_accelerate.py \
--model_args="pretrained=<path to model on the hub>" \
--tasks <task parameters> \
--output_dir output_dir
```

Here, `--tasks` refers to either a _comma-separated_ list of supported tasks from the [metadata table](src/lighteval/tasks/tasks_table.jsonl) in the format:

```
suite|task|num_few_shot|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}
```

or a file path like [`tasks_examples/recommended_set.txt`](./tasks_examples/recommended_set.txt) which specifies multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA benchmark run:

```shell
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
--model_args "pretrained=gpt2" \
--tasks "lighteval|truthfulqa:mc|0|0" \
--override_batch_size 1 \
--output_dir="./evals/"
```

Here, `--override_batch_size` defines the _batch size per device_, so the effective batch size will be `override_batch_size x num_gpus`. To evaluate on multiple benchmarks, separate each task configuration with a comma, e.g.

```shell
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
--model_args "pretrained=gpt2" \
--tasks "lighteval|truthfulqa:mc|0|0,lighteval|gsm8k|0|0" \
--override_batch_size 1 \
--output_dir="./evals/"
```

### Usage
- Launching on CPU
- `python run_evals_accelerate.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir output_dir`
- Using data parallelism on several GPUs (recommended)
- If you want to use data parallelism, first configure accelerate (`accelerate config`).
- `accelerate launch <accelerate parameters> run_evals_accelerate.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir=<your output dir>`
for instance: `python -m accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=tmp/`
- Note: if you use model_parallel, accelerate will use 2 processes for model parallel, num_processes for data parallel

The task parameters indicate which tasks you want to launch. You can select:
- one or several tasks, with `--tasks task_names`, with task_names in the [metadata table](src/lighteval/tasks/tasks_table.jsonl), separated by commas. You must specify which version of the task you want (= in which suite it is), by prepending the suite name, as well as the number of training few_shots prompts for the given task, and whether you want to automatically reduce the number of few_shots if they make the prompt too long (`suite|task|few_shot|1 or 0 to automatically reduce the number of few_shots or not`).
- a file path, which contains tasks following the above format.

Example
If you want to compare hellaswag from helm and the harness on Gpt-6j, you can do
`python run_evals_accelerate.py --model hf_causal --model_args="pretrained=EleutherAI/gpt-j-6b" --tasks helm|hellaswag|0|0,lighteval|hellaswag|0|0 --output_dir output_dir`
See the [`tasks_examples/recommended_set.txt`](./tasks_examples/recommended_set.txt) file for a list of recommended task configurations.

### Evaluating a large model with pipeline parallelism

To evaluate models larger that ~40B parameters in 16-bit precision, you will need to shard the model across multiple GPUs to fit it in VRAM. You can do this by passing `model_parallel=True` and adapting `--num_processes` to be the number of processes to use for data parallel. For example, on a single node of 8 GPUs, you can run:

```shell
# PP=2, DP=4 - good for models < 70B params
accelerate launch --multi_gpu --num_processes=4 run_evals_accelerate.py \
--model_args="pretrained=<path to model on the hub>" \
--model_parallel \
--tasks <task parameters> \
--output_dir output_dir

# PP=4, DP=2 - good for huge models >= 70B params
accelerate launch --multi_gpu --num_processes=2 run_evals_accelerate.py \
--model_args="pretrained=<path to model on the hub>" \
--model_parallel \
--tasks <task parameters> \
--output_dir output_dir
```

### Evaluate a model on the Open LLM Leaderboard benchmarks

To evaluate a model on all the benchmarks of the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) using a single node of 8 GPUs, run:

```shell
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
--model_args "pretrained=<model name>" \
--tasks tasks_examples/open_llm_leaderboard_tasks.txt \
--override_batch_size 1 \
--output_dir="./evals/"
```

### Evaluate a model on CPU

You can also use `lighteval` to evaluate models on CPU, although note this will typically be very slow for large models. To do so, run:

```shell
python run_evals_accelerate.py \
--model_args="pretrained=<path to model on the hub>"\
--tasks <task parameters> \
--output_dir output_dir
```

## Deep thanks
`lighteval` was originally built on top of the great [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which is powering the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). We also took a lot of inspiration from the amazing [HELM](https://crfm.stanford.edu/helm/latest/), notably for metrics.

Through adding more and more logging functionalities, and making it compatible with increasingly different workflows and model codebases (including 3D parallelism) as well as allowing custom evaluation experiments, metrics and benchmarks, we ended up needing to change the code more and more deeply until `lighteval` became the small standalone library that it is now.

However, we are very grateful to the Harness and HELM teams for their continued work on better evaluations.

## How to navigate this project
`lighteval` is supposed to be used as a standalone evaluation library.
- To run the evaluations, you can use `run_evals_accelerate.py` or `run_evals_nanotron.py`.
- [src/lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the lib itself
- [lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the library, divided in the following section
- [main_accelerate.py](https://github.com/huggingface/lighteval/blob/main/src/lighteval/main_accelerate.py) and [main_nanotron.py](https://github.com/huggingface/lighteval/blob/main/src/lighteval/main_nanotron.py) are our entry points to run evaluation
- [logging](https://github.com/huggingface/lighteval/tree/main/src/lighteval/logging): Our loggers, to display experiment information and push it to the hub after a run
- [metrics](https://github.com/huggingface/lighteval/tree/main/src/lighteval/metrics): All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
- [models](https://github.com/huggingface/lighteval/tree/main/src/lighteval/models): Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
- [tasks](https://github.com/huggingface/lighteval/tree/main/src/lighteval/tasks): Available tasks. The complete list is in `tasks_table.jsonl`, and you'll find all the prompts in `tasks_prompt_formatting.py`.
- [tasks_examples](https://github.com/huggingface/lighteval/tree/main/tasks_examples) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
- [tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.

## Customisation
### Adding a new metric
Expand Down

0 comments on commit 480d85e

Please sign in to comment.