-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Weights and Biases support #1339
Conversation
@ayulockin Thanks a ton for the overhaul! Would you be able to add a notebook to |
Hey @StellaAthena, I will add a notebook today. Just finalising a few more features. Will fix failing CIs too. |
Thanks very much! The support generally looks great to me, just would be preferable to move We might consider putting this into a new |
hey @haileyschoelkopf, sounds good. I was about to get decision on the namespace/scope. I will move the logging stuff to |
@ayulockin just let me know if this is all ready to review! I see there have been more changes so not sure if this is currently the final version yet? |
Hey @haileyschoelkopf, this is still a work in progress. There are a few more improvements I wanna push in, likely by EOD. I will ping you once it's ready. Thanks 😊 |
No problem and no rush! Just wanted to make sure it wasn't blocked on me |
Looking forward to it! |
Hey @haileyschoelkopf, after updating the branch (also tested from the main branch) I am getting this assertion error: The command:
The error:
|
Thanks for reporting! I think that using one of the incoming PRs fixes this by rounding up the number of docs when a float is passed to limit, will see if we can't get that merged sooner. |
Hi @ayulockin , tested this out a bit and it looks really nice, thank you for the work on it!
Regarding this, #1441 should fix it!
These tables look very nice overall! One thing:
Approved, conditional on the minor updates to the tables! |
Hey @haileyschoelkopf, thank you for the feedback. I am working on fixing the table formatting. |
|
Hi @ayulockin , you should be able to get group membership via |
I think this PR is good to go now? @ayulockin were there any other last-minute things you wanted to address? |
Hey @haileyschoelkopf I am addressing one final bit. The PR will be ready in a couple of hrs. |
Hey @haileyschoelkopf one final nit that I tried and wanna bring this to your attention for transparency. Currently the wandb logic in the main.py file is confined to one block as shown below: (L319-L328)
I did this so that I don't spread the logic all over the this goes right after arguments are parsed and will initialise a W&B run.
And at the end of the script do:
The remaining diff remains as previously. The benefit of doing so:
Check out this run page: https://wandb.ai/ayush-thakur/lm-eval-harness-integration/runs/jz3fuidc/system?workspace=user-ayush-thakur If you are okay with it, I will make the commit now and we are good to go from my end. Otherwise, the PR is complete from my end. |
This sounds good to me, thanks @ayulockin for all your work on this! |
Thanks @haileyschoelkopf. The PR is complete from my end. :) |
Unsure why linter test job is failing, they seem to all pass when I check locally! |
filtered_resps = [ | ||
np.argmax([n[0] for n in x["filtered_resps"]]) for x in data | ||
] | ||
elif config["output_type"] == "loglikelihood_rolling": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intended, that in two elifs (about generate_until
and loglikelihood_rolling
) code is the same? If it is not copy-paste mistake, then may this be changed to some another check like
config["output_type"] in {"generate_until" , "loglikelihood_rolling"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah you are right. oversight on my part. I am still working with this repo on a few projects. I will incorporate the code trim in the separate PR. Thanks for catching @LSinev :)
metrics[metric] = [x[metric] for x in data] | ||
|
||
if config["output_type"] == "loglikelihood": | ||
instance = [x["arguments"][0][0] for x in data] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instance = [x["arguments"][0][0] for x in data]
this part seems to be the same in all if cases, is it really necessary to define the same inside conditional statement?
* add wandb as extra dependency * wandb metrics logging * refactor * log samples as tables * fix linter * refactor: put in a class * change dir * add panels * log eval as table * improve tables logging * improve reports logging * precommit run * ruff check * handle importing reports api gracefully * ruff * compare results * minor pre-commit fixes * build comparison report * ruff check * log results as artifacts * remove comparison script * update dependency * type annotate and docstring * add example * update readme * fix typo * teardown * handle outside wandb run * gracefully fail reports creation * precommit checks * add report url to summary * use wandb printer for better url stdout * fix ruff * handle N/A and groups * fix eval table * remove unused var * update wandb version req + disable reports stdout * remove reports feature to TODO * add label to multi-choice question data * log model predictions * lints * loglikelihood_rolling * log eval result for groups * log tables by group for better handling * precommit * choices column for multi-choice * graciously fail wandb * remove reports feature * track system metrics + total eval time + stdout --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
* add wandb as extra dependency * wandb metrics logging * refactor * log samples as tables * fix linter * refactor: put in a class * change dir * add panels * log eval as table * improve tables logging * improve reports logging * precommit run * ruff check * handle importing reports api gracefully * ruff * compare results * minor pre-commit fixes * build comparison report * ruff check * log results as artifacts * remove comparison script * update dependency * type annotate and docstring * add example * update readme * fix typo * teardown * handle outside wandb run * gracefully fail reports creation * precommit checks * add report url to summary * use wandb printer for better url stdout * fix ruff * handle N/A and groups * fix eval table * remove unused var * update wandb version req + disable reports stdout * remove reports feature to TODO * add label to multi-choice question data * log model predictions * lints * loglikelihood_rolling * log eval result for groups * log tables by group for better handling * precommit * choices column for multi-choice * graciously fail wandb * remove reports feature * track system metrics + total eval time + stdout --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
* add wandb as extra dependency * wandb metrics logging * refactor * log samples as tables * fix linter * refactor: put in a class * change dir * add panels * log eval as table * improve tables logging * improve reports logging * precommit run * ruff check * handle importing reports api gracefully * ruff * compare results * minor pre-commit fixes * build comparison report * ruff check * log results as artifacts * remove comparison script * update dependency * type annotate and docstring * add example * update readme * fix typo * teardown * handle outside wandb run * gracefully fail reports creation * precommit checks * add report url to summary * use wandb printer for better url stdout * fix ruff * handle N/A and groups * fix eval table * remove unused var * update wandb version req + disable reports stdout * remove reports feature to TODO * add label to multi-choice question data * log model predictions * lints * loglikelihood_rolling * log eval result for groups * log tables by group for better handling * precommit * choices column for multi-choice * graciously fail wandb * remove reports feature * track system metrics + total eval time + stdout --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
In #359 @parambharat proposed to add support for W&B logging. However it was done before the big refactor that got in.
As a user of both lm-evaluation-harness and wandb, I have opened this PR to add support for W&B logging.
Functionalities
The integration provide functionalities
results.json
file as an artifact for version control,<task_name>_eval_samples.json
file if the samples are logged,Installation:
pip install lm_eval[wandb]
Run Eval Harness:
Example
Here's a W&B run page with lm-eval-harness run on hellaswag and mmlu_abstract_algebra tasks using the microsoft/phi-2 model.
Here's the automatic generated report: https://wandb.ai/ayush-thakur/lm-eval-harness-integration/reports/-2024-02-09-12-16-01-wdp5ubxs-Evaluation-report--Vmlldzo2NjgzMDkz