Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Weights and Biases support #1339

Merged
merged 57 commits into from
Feb 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
81173a3
add wandb as extra dependency
ayulockin Jan 12, 2024
7ddc9b0
wandb metrics logging
ayulockin Jan 15, 2024
4e2b091
refactor
ayulockin Jan 25, 2024
0f3c921
log samples as tables
ayulockin Jan 29, 2024
c3919a4
fix linter
ayulockin Jan 29, 2024
9e9da94
refactor: put in a class
ayulockin Jan 29, 2024
a83c427
change dir
ayulockin Jan 30, 2024
d4e9f57
add panels
ayulockin Jan 30, 2024
f8ab85a
log eval as table
ayulockin Jan 30, 2024
5a0c227
improve tables logging
ayulockin Jan 31, 2024
5c4850a
improve reports logging
ayulockin Jan 31, 2024
6787691
Merge branch 'main' into wandb-logging
lintangsutawika Feb 2, 2024
98d49ca
Merge branch 'EleutherAI:main' into wandb-logging
ayulockin Feb 3, 2024
bc14ada
Merge branch 'EleutherAI:main' into wandb-logging
ayulockin Feb 6, 2024
21c3384
precommit run
ayulockin Feb 6, 2024
241e0f1
ruff check
ayulockin Feb 6, 2024
63c4009
handle importing reports api gracefully
ayulockin Feb 6, 2024
6caee7d
ruff
ayulockin Feb 6, 2024
229bc93
compare results
ayulockin Feb 7, 2024
e4c9444
minor pre-commit fixes
ayulockin Feb 7, 2024
fc4ceb0
build comparison report
ayulockin Feb 7, 2024
9426ded
ruff check
ayulockin Feb 7, 2024
25e430e
Merge branch 'EleutherAI:main' into wandb-logging
ayulockin Feb 8, 2024
a3d71a9
log results as artifacts
ayulockin Feb 8, 2024
347d636
remove comparison script
ayulockin Feb 8, 2024
9519000
update dependency
ayulockin Feb 8, 2024
c94afb5
type annotate and docstring
ayulockin Feb 8, 2024
38e706c
add example
ayulockin Feb 8, 2024
21b48cb
update readme
ayulockin Feb 8, 2024
d87072e
fix typo
ayulockin Feb 8, 2024
f273176
teardown
ayulockin Feb 9, 2024
778751d
handle outside wandb run
ayulockin Feb 9, 2024
90793d0
gracefully fail reports creation
ayulockin Feb 9, 2024
5f7f49f
precommit checks
ayulockin Feb 9, 2024
55b238b
add report url to summary
ayulockin Feb 9, 2024
aa19bee
use wandb printer for better url stdout
ayulockin Feb 9, 2024
f12d050
Merge branch 'EleutherAI:main' into wandb-logging
ayulockin Feb 12, 2024
57ee956
fix ruff
ayulockin Feb 12, 2024
9f72da7
handle N/A and groups
ayulockin Feb 15, 2024
9bcd8a8
fix eval table
ayulockin Feb 16, 2024
56177de
remove unused var
ayulockin Feb 16, 2024
71138f2
update wandb version req + disable reports stdout
ayulockin Feb 16, 2024
e1cf32b
remove reports feature to TODO
ayulockin Feb 16, 2024
fab8ba6
add label to multi-choice question data
ayulockin Feb 16, 2024
e288f98
log model predictions
ayulockin Feb 16, 2024
4f9f065
lints
ayulockin Feb 16, 2024
2542e47
Merge branch 'EleutherAI:main' into wandb-logging
ayulockin Feb 20, 2024
53e2823
loglikelihood_rolling
ayulockin Feb 20, 2024
57c7bd8
Merge branch 'EleutherAI:main' into wandb-logging
ayulockin Feb 21, 2024
a809450
log eval result for groups
ayulockin Feb 21, 2024
188be84
log tables by group for better handling
ayulockin Feb 21, 2024
06b22f1
precommit
ayulockin Feb 21, 2024
05994c3
choices column for multi-choice
ayulockin Feb 21, 2024
f24cc9d
graciously fail wandb
ayulockin Feb 21, 2024
89503de
Merge branch 'EleutherAI:main' into wandb-logging
ayulockin Feb 22, 2024
9279b05
remove reports feature
ayulockin Feb 22, 2024
40f2d19
track system metrics + total eval time + stdout
ayulockin Feb 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@ temp
# IPython
profile_default/
ipython_config.py
wandb
examples/wandb
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github

## Visualizing Results

You can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.

### Zeno

You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.

First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
Expand Down Expand Up @@ -284,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a

You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).

### Weights and Biases

With the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.

The integration provide functionalities

- to automatically log the evaluation results,
- log the samples as W&B Tables for easy visualization,
- log the `results.json` file as an artifact for version control,
- log the `<task_name>_eval_samples.json` file if the samples are logged,
- generate a comprehensive report for analysis and visualization with all the important metric,
- log task and cli specific configs,
- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.

First you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.

Authenticate your machine with an your unique W&B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.

Run eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.

```bash
lm_eval \
--model hf \
--model_args pretrained=microsoft/phi-2,trust_remote_code=True \
--tasks hellaswag,mmlu_abstract_algebra \
--device cuda:0 \
--batch_size 8 \
--output_path output/phi-2 \
--limit 10 \
--wandb_args project=lm-eval-harness-integration \
--log_samples
```

In the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb).

## How to Contribute or Learn More?

For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
Expand Down
130 changes: 130 additions & 0 deletions examples/visualize-wandb.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fc477b96-adee-4829-a9d7-a5eb990df358",
"metadata": {},
"source": [
"# Visualizing Results in Weights and Biases\n",
"\n",
"With the Weights and Biases integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.\n",
"\n",
"The integration provide functionalities\n",
"\n",
"- to automatically log the evaluation results,\n",
"- log the samples as W&B Tables for easy visualization,\n",
"- log the `results.json` file as an artifact for version control,\n",
"- log the `<task_name>_eval_samples.json` file if the samples are logged,\n",
"- generate a comprehensive report for analysis and visualization with all the important metric,\n",
"- log task and cli configs,\n",
"- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\n",
"\n",
"The integration is super easy to use with the eval harness. Let's see how!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3851439a-bff4-41f2-bf21-1b3d8704913b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Install this project if you did not already have it.\n",
"# This is all that is needed to be installed to start using Weights and Biases\n",
"\n",
"!pip -qq install -e ..[wandb]"
]
},
{
"cell_type": "markdown",
"id": "8507fd7e-3b99-4a92-89fa-9eaada74ba91",
"metadata": {},
"source": [
"# Run the Eval Harness\n",
"\n",
"Run the eval harness as usual with a `wandb_args` flag. This flag is used to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\n",
"\n",
"If `wandb_args` flag is used, the metrics and all other goodness will be automatically logged to Weights and Biases. In the stdout, you will find the link to the W&B run page as well as link to the generated report."
]
},
{
"cell_type": "markdown",
"id": "eec5866e-f01e-42f8-8803-9d77472ef991",
"metadata": {},
"source": [
"## Set your API Key\n",
"\n",
"Before you can use W&B, you need to authenticate your machine with an authentication key. Visit https://wandb.ai/authorize to get one."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d824d163-71a9-4313-935d-f1d56397841c",
"metadata": {},
"outputs": [],
"source": [
"import wandb\n",
"wandb.login()"
]
},
{
"cell_type": "markdown",
"id": "124e4a34-1547-4bed-bc09-db012bacbda6",
"metadata": {},
"source": [
"> Note that if you are using command line you can simply authenticate your machine by doing `wandb login` in your terminal. For more info check out the [documentation](https://docs.wandb.ai/quickstart#2-log-in-to-wb)."
]
},
{
"cell_type": "markdown",
"id": "abc6f6b6-179a-4aff-ada9-f380fb74df6e",
"metadata": {},
"source": [
"## Run and log to W&B"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd0a8130-a97b-451a-acd2-3f9885b88643",
"metadata": {},
"outputs": [],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\n",
" --tasks hellaswag,mmlu_abstract_algebra \\\n",
" --device cuda:0 \\\n",
" --batch_size 8 \\\n",
" --output_path output/phi-2 \\\n",
" --limit 10 \\\n",
" --wandb_args project=lm-eval-harness-integration \\\n",
" --log_samples"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
23 changes: 23 additions & 0 deletions lm_eval/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import numpy as np

from lm_eval import evaluator, utils
from lm_eval.logging_utils import WandbLogger
from lm_eval.tasks import TaskManager, include_path, initialize_tasks
from lm_eval.utils import make_table

Expand Down Expand Up @@ -167,6 +168,11 @@ def parse_eval_args() -> argparse.Namespace:
metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
)
parser.add_argument(
"--wandb_args",
default="",
help="Comma separated string arguments passed to wandb.init, e.g. `project=lm-eval,job_type=eval",
)
parser.add_argument(
"--predict_only",
"-x",
Expand Down Expand Up @@ -195,6 +201,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
# we allow for args to be passed externally, else we parse them ourselves
args = parse_eval_args()

if args.wandb_args:
wandb_logger = WandbLogger(args)

eval_logger = utils.eval_logger
eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
eval_logger.info(f"Verbosity set to {args.verbosity}")
Expand Down Expand Up @@ -309,6 +318,16 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:

batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))

# Add W&B logging
if args.wandb_args:
try:
wandb_logger.post_init(results)
wandb_logger.log_eval_result()
if args.log_samples:
wandb_logger.log_eval_samples(samples)
except Exception as e:
eval_logger.info(f"Logging to Weights and Biases failed due to {e}")

if args.output_path:
output_path_file.open("w", encoding="utf-8").write(dumped)

Expand All @@ -334,6 +353,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
if "groups" in results:
print(make_table(results, "groups"))

if args.wandb_args:
# Tear down wandb run once all the logging is done.
wandb_logger.run.finish()


if __name__ == "__main__":
cli_evaluate()
Loading
Loading