feat: Add Weights and Biases support #1339

ayulockin · 2024-01-23T11:41:29Z

In #359 @parambharat proposed to add support for W&B logging. However it was done before the big refactor that got in.

As a user of both lm-evaluation-harness and wandb, I have opened this PR to add support for W&B logging.

Functionalities

The integration provide functionalities

to automatically log the evaluation results,
log the samples as W&B Tables for easy visualization,
log the results.json file as an artifact for version control,
log the <task_name>_eval_samples.json file if the samples are logged,
log task and cli specific configs,
and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.

Installation:

pip install lm_eval[wandb]

Run Eval Harness:

lm_eval \
    --model hf \
    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \
    --tasks hellaswag,mmlu_abstract_algebra \
    --device cuda:0 \
    --batch_size 8 \
    --output_path output/phi-2 \
    --limit 10 \
    --wandb_args project=lm-eval-harness-integration \
    --log_samples

Example

Here's a W&B run page with lm-eval-harness run on hellaswag and mmlu_abstract_algebra tasks using the microsoft/phi-2 model.

Here's the automatic generated report: https://wandb.ai/ayush-thakur/lm-eval-harness-integration/reports/-2024-02-09-12-16-01-wdp5ubxs-Evaluation-report--Vmlldzo2NjgzMDkz

CLAassistant · 2024-01-23T11:41:35Z

All committers have signed the CLA.

StellaAthena · 2024-01-25T20:54:27Z

@ayulockin Thanks a ton for the overhaul! Would you be able to add a notebook to /examples/ demo'ing the features?

ayulockin · 2024-01-29T04:50:25Z

Hey @StellaAthena, I will add a notebook today. Just finalising a few more features. Will fix failing CIs too.

haileyschoelkopf · 2024-01-29T13:41:53Z

Thanks very much! The support generally looks great to me, just would be preferable to move wandb.py outside of the api/ folder as that contains largely abstract base classes / core building blocks. (Additionally, won't naming the file wandb.py have a chance of causing circular import issues due to shadowing the wandb package name? perhaps wandb_logger.py would be preferable)

We might consider putting this into a new lm_eval/logging_utils.py and then move our current logger also into this file, and/or to allow for this file to later on contain things like tensorboard support or other 3rd-party loggers if they are contributed or requested.

ayulockin · 2024-01-29T17:03:27Z

hey @haileyschoelkopf, sounds good. I was about to get decision on the namespace/scope. I will move the logging stuff to lm_eval/logging_utils.py. I have condensed different functions into a single class for better state management.

haileyschoelkopf · 2024-02-01T14:08:50Z

@ayulockin just let me know if this is all ready to review! I see there have been more changes so not sure if this is currently the final version yet?

ayulockin · 2024-02-01T14:11:54Z

Hey @haileyschoelkopf, this is still a work in progress. There are a few more improvements I wanna push in, likely by EOD. I will ping you once it's ready. Thanks 😊

haileyschoelkopf · 2024-02-01T14:29:16Z

No problem and no rush! Just wanted to make sure it wasn't blocked on me

Luodian · 2024-02-02T06:39:07Z

Looking forward to it!

ayulockin · 2024-02-03T04:40:31Z

Hey @haileyschoelkopf, after updating the branch (also tested from the main branch) I am getting this assertion error:

The command:

lm_eval --model hf --model_args pretrained=microsoft/phi-2,trust_remote_code=True --tasks hellaswag,mmlu_abstract_algebra --device cuda:0 --batch_size 32 --output_path output/phi-2 --limit 0.001 --log_samples

The error:

2024-02-03:04:37:01,971 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
2024-02-03:04:37:02,434 INFO     [config.py:58] PyTorch version 2.1.2+cu118 available.
2024-02-03:04:37:04,261 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-03:04:37:04,262 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-03:04:37:10,640 WARNING  [__main__.py:174]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2024-02-03:04:37:10,641 WARNING  [__main__.py:224] File already exists at output/phi-2. Results will be overwritten.
2024-02-03:04:37:10,641 INFO     [__main__.py:238] Selected Tasks: ['hellaswag', 'mmlu_abstract_algebra']
2024-02-03:04:37:10,641 INFO     [__main__.py:239] Loading selected tasks...
2024-02-03:04:37:10,665 WARNING  [logging.py:61] Detected kernel version 4.19.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-02-03:04:37:10,665 INFO     [huggingface.py:148] Using device 'cuda:0'
Loading checkpoint shards: 100%|█████████████████████████████████| 2/2 [00:03<00:00,  1.65s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-02-03:04:37:15,082 INFO     [evaluator.py:139] get_task_dict has been updated to accept an optional argument, `task_manager`Read more here: https://github.com/EleutherAI/lm-evaluation-harness/blob/recursive-groups/docs/interface.md#external-library-usage
/opt/conda/envs/lm-eval/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for hellaswag contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/hellaswag
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
/opt/conda/envs/lm-eval/lib/python3.10/site-packages/datasets/load.py:1429: FutureWarning: The repository for hails/mmlu_no_train contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/hails/mmlu_no_train
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
2024-02-03:04:37:23,076 INFO     [task.py:360] Building contexts for task on rank 0...
Traceback (most recent call last):
  File "/opt/conda/envs/lm-eval/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/__main__.py", line 241, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/evaluator.py", line 179, in simple_evaluate
    results = evaluate(
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/evaluator.py", line 328, in evaluate
    task.build_all_requests(limit=limit, rank=lm.rank, world_size=lm.world_size)
  File "/home/ayushthakur/lm-eval/lm-evaluation-harness/lm_eval/api/task.py", line 385, in build_all_requests
    assert len(self._instances) != 0, "task.build_requests() did not find any docs!"
AssertionError: task.build_requests() did not find any docs!

haileyschoelkopf · 2024-02-03T13:40:19Z

Thanks for reporting!

I think that using --limit 0.01 or say --limit 10 would fix this, I believe it's because mmlu_abstract_algebra has < 1000 docs but your --limit passes 1/1000.

one of the incoming PRs fixes this by rounding up the number of docs when a float is passed to limit, will see if we can't get that merged sooner.

Luodian · 2024-02-04T05:40:50Z

Is there a way to disable the line's log?

import wandb.apis.reports as wr

It would print multiple times on different ranks.

haileyschoelkopf · 2024-02-20T15:23:18Z

Hi @ayulockin , tested this out a bit and it looks really nice, thank you for the work on it!

Hey @haileyschoelkopf, the --predict_only flag writes model outputs only for those tasks whose output type is generate_until?

Regarding this, #1441 should fix it!

Hey @haileyschoelkopf I was able to add some logic in this commit to log model outputs to W&B Table.

These tables look very nice overall! One thing:

For multiple_choice tasks, if there were a column that provided the list of possible answer strings that'd be really nice.
I like what you did with loglikelihood_rolling!

Approved, conditional on the minor updates to the tables!

ayulockin · 2024-02-21T15:37:41Z

Hey @haileyschoelkopf, thank you for the feedback. I am working on fixing the table formatting.

ayulockin · 2024-02-21T17:07:03Z

Hey @haileyschoelkopf what would be the best way to find which task belongs to which group? In the case of mmlu - mmlu is the parent group with four sub groups and multiple tasks within each sub group.

haileyschoelkopf · 2024-02-21T20:07:31Z

Hi @ayulockin , you should be able to get group membership via list(reversed(task_hierarchy.items()))) !

haileyschoelkopf · 2024-02-22T14:43:33Z

I think this PR is good to go now? @ayulockin were there any other last-minute things you wanted to address?

ayulockin · 2024-02-22T16:26:22Z

Hey @haileyschoelkopf I am addressing one final bit. The PR will be ready in a couple of hrs.

ayulockin · 2024-02-22T17:23:30Z

Hey @haileyschoelkopf one final nit that I tried and wanna bring this to your attention for transparency.

Currently the wandb logic in the main.py file is confined to one block as shown below: (L319-L328)

            try:
                wandb_logger = WandbLogger(results, args)
                wandb_logger.log_eval_result()
                if args.log_samples:
                    wandb_logger.log_eval_samples(samples)
                # Tear down wandb run once all the logging is done.
                wandb_logger.run.finish()
            except Exception as e:
                eval_logger.info(f"Logging to Weights and Biases failed due to {e}")

I did this so that I don't spread the logic all over the __main__.py file. However, if you are okay with me doing so, I can split the above block of code.

this goes right after arguments are parsed and will initialise a W&B run.

if args.wandb_args:
    wandb_logger = WandbLogger(args)

And at the end of the script do:

if args.wandb_args:
    wandb_logger.run.finish()

The remaining diff remains as previously.

The benefit of doing so:

we get the exact run time of the script. The user will know how long evaluation took.
we get the GPU and CPU utilisation. This will be useful for the user to see how well their GPU is doing.
The stdout is logged as well.

Check out this run page: https://wandb.ai/ayush-thakur/lm-eval-harness-integration/runs/jz3fuidc/system?workspace=user-ayush-thakur

If you are okay with it, I will make the commit now and we are good to go from my end. Otherwise, the PR is complete from my end.

haileyschoelkopf · 2024-02-22T17:30:46Z

This sounds good to me, thanks @ayulockin for all your work on this!

ayulockin · 2024-02-22T17:34:15Z

Thanks @haileyschoelkopf. The PR is complete from my end. :)

haileyschoelkopf · 2024-02-22T20:20:55Z

Unsure why linter test job is failing, they seem to all pass when I check locally!

LSinev · 2024-02-23T16:07:37Z

lm_eval/logging_utils.py

+            filtered_resps = [
+                np.argmax([n[0] for n in x["filtered_resps"]]) for x in data
+            ]
+        elif config["output_type"] == "loglikelihood_rolling":


Is it intended, that in two elifs (about generate_until and loglikelihood_rolling) code is the same? If it is not copy-paste mistake, then may this be changed to some another check like
config["output_type"] in {"generate_until" , "loglikelihood_rolling"}

oh yeah you are right. oversight on my part. I am still working with this repo on a few projects. I will incorporate the code trim in the separate PR. Thanks for catching @LSinev :)

LSinev · 2024-02-23T16:09:15Z

lm_eval/logging_utils.py

+                metrics[metric] = [x[metric] for x in data]
+
+        if config["output_type"] == "loglikelihood":
+            instance = [x["arguments"][0][0] for x in data]


instance = [x["arguments"][0][0] for x in data]
this part seems to be the same in all if cases, is it really necessary to define the same inside conditional statement?

* add wandb as extra dependency * wandb metrics logging * refactor * log samples as tables * fix linter * refactor: put in a class * change dir * add panels * log eval as table * improve tables logging * improve reports logging * precommit run * ruff check * handle importing reports api gracefully * ruff * compare results * minor pre-commit fixes * build comparison report * ruff check * log results as artifacts * remove comparison script * update dependency * type annotate and docstring * add example * update readme * fix typo * teardown * handle outside wandb run * gracefully fail reports creation * precommit checks * add report url to summary * use wandb printer for better url stdout * fix ruff * handle N/A and groups * fix eval table * remove unused var * update wandb version req + disable reports stdout * remove reports feature to TODO * add label to multi-choice question data * log model predictions * lints * loglikelihood_rolling * log eval result for groups * log tables by group for better handling * precommit * choices column for multi-choice * graciously fail wandb * remove reports feature * track system metrics + total eval time + stdout --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

ayulockin added 2 commits January 12, 2024 10:58

add wandb as extra dependency

81173a3

wandb metrics logging

7ddc9b0

refactor

4e2b091

ayulockin marked this pull request as ready for review January 25, 2024 14:46

ayulockin requested review from haileyschoelkopf and lintangsutawika as code owners January 25, 2024 14:46

StellaAthena mentioned this pull request Jan 25, 2024

feat: Add Weight & Biases logging and reporting #359

Closed

ayulockin added 2 commits January 29, 2024 11:01

log samples as tables

0f3c921

fix linter

c3919a4

refactor: put in a class

9e9da94

ayulockin added 5 commits January 30, 2024 09:52

change dir

a83c427

add panels

d4e9f57

log eval as table

f8ab85a

improve tables logging

5a0c227

improve reports logging

5c4850a

lintangsutawika and others added 2 commits February 2, 2024 13:44

Merge branch 'main' into wandb-logging

6787691

Merge branch 'EleutherAI:main' into wandb-logging

98d49ca

Merge branch 'EleutherAI:main' into wandb-logging

bc14ada

loglikelihood_rolling

53e2823

ayulockin added 2 commits February 21, 2024 16:59

Merge branch 'EleutherAI:main' into wandb-logging

57c7bd8

log eval result for groups

a809450

ayulockin added 4 commits February 21, 2024 17:59

log tables by group for better handling

188be84

precommit

06b22f1

choices column for multi-choice

05994c3

graciously fail wandb

f24cc9d

Merge branch 'EleutherAI:main' into wandb-logging

89503de

remove reports feature

9279b05

track system metrics + total eval time + stdout

40f2d19

veekaybee mentioned this pull request Feb 22, 2024

Add the ability to see more granular metrics from wandb table artifacts mozilla-ai/lm-buddy#69

Closed

haileyschoelkopf approved these changes Feb 22, 2024

View reviewed changes

haileyschoelkopf merged commit 2683fbb into EleutherAI:main Feb 22, 2024
7 of 8 checks passed

LSinev reviewed Feb 23, 2024

View reviewed changes

This was referenced Feb 23, 2024

Adding documentation for Weights and Biases CLI interface #1466

Merged

Passing wandb_args to simple_evaluate #1478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Weights and Biases support #1339

feat: Add Weights and Biases support #1339

ayulockin commented Jan 23, 2024 •

edited

Loading

CLAassistant commented Jan 23, 2024 •

edited

Loading

StellaAthena commented Jan 25, 2024

ayulockin commented Jan 29, 2024

haileyschoelkopf commented Jan 29, 2024

ayulockin commented Jan 29, 2024

haileyschoelkopf commented Feb 1, 2024 •

edited

Loading

ayulockin commented Feb 1, 2024

haileyschoelkopf commented Feb 1, 2024

Luodian commented Feb 2, 2024 •

edited

Loading

ayulockin commented Feb 3, 2024

haileyschoelkopf commented Feb 3, 2024

Luodian commented Feb 4, 2024

haileyschoelkopf commented Feb 20, 2024

ayulockin commented Feb 21, 2024

ayulockin commented Feb 21, 2024 •

edited

Loading

haileyschoelkopf commented Feb 21, 2024

haileyschoelkopf commented Feb 22, 2024

ayulockin commented Feb 22, 2024

ayulockin commented Feb 22, 2024 •

edited

Loading

haileyschoelkopf commented Feb 22, 2024

ayulockin commented Feb 22, 2024

haileyschoelkopf commented Feb 22, 2024

LSinev Feb 23, 2024

ayulockin Feb 23, 2024

LSinev Feb 23, 2024

feat: Add Weights and Biases support #1339

feat: Add Weights and Biases support #1339

Conversation

ayulockin commented Jan 23, 2024 • edited Loading

Functionalities

Installation:

Run Eval Harness:

Example

CLAassistant commented Jan 23, 2024 • edited Loading

StellaAthena commented Jan 25, 2024

ayulockin commented Jan 29, 2024

haileyschoelkopf commented Jan 29, 2024

ayulockin commented Jan 29, 2024

haileyschoelkopf commented Feb 1, 2024 • edited Loading

ayulockin commented Feb 1, 2024

haileyschoelkopf commented Feb 1, 2024

Luodian commented Feb 2, 2024 • edited Loading

ayulockin commented Feb 3, 2024

haileyschoelkopf commented Feb 3, 2024

Luodian commented Feb 4, 2024

haileyschoelkopf commented Feb 20, 2024

ayulockin commented Feb 21, 2024

ayulockin commented Feb 21, 2024 • edited Loading

haileyschoelkopf commented Feb 21, 2024

haileyschoelkopf commented Feb 22, 2024

ayulockin commented Feb 22, 2024

ayulockin commented Feb 22, 2024 • edited Loading

haileyschoelkopf commented Feb 22, 2024

ayulockin commented Feb 22, 2024

haileyschoelkopf commented Feb 22, 2024

LSinev Feb 23, 2024

Choose a reason for hiding this comment

ayulockin Feb 23, 2024

Choose a reason for hiding this comment

LSinev Feb 23, 2024

Choose a reason for hiding this comment

ayulockin commented Jan 23, 2024 •

edited

Loading

CLAassistant commented Jan 23, 2024 •

edited

Loading

haileyschoelkopf commented Feb 1, 2024 •

edited

Loading

Luodian commented Feb 2, 2024 •

edited

Loading

ayulockin commented Feb 21, 2024 •

edited

Loading

ayulockin commented Feb 22, 2024 •

edited

Loading