Release BigCodeBench v0.2.0

terryyz released this 06 Oct 08:28

· 62 commits to main since this release

58b3f2d

Breaking Change

No more waiting! The evalution now fully supports batch inference!
No more environment configs! The code execution is done by a remote API endpoint by default, and can be customized.
No more multiple commands! bigcodebench.evaluate will be good enough to handle most cases.

What's Changed

add multiprocessing support for sanitization step by @sk-g in #37
Remove extra period in task BigCodeBench/16 by @hvaara in #38
Await futures in progress checker by @hvaara in #48
A few args have been added to this version, including --direct_completion and --local_execute. See Advanced Usage for the details.

Dataset maintainence

The benchmark data has been bumped to v0.1.2. You can load the dataset with from datasets import load_data; ds = load_data("bigcode/bigcodebench", split="v0.1.2")
BigCodeBench/16: removed period
BigCodeBench/37: added pandas requirement
BigCodeBench/178: removed urlib requirement
BigCodeBench/241: added required plot title
BigCodeBench/267: added required plot title
BigCodeBench/760: changed the import of datetime
BigCodeBench/1006: replaced test links due to the potential connection block

New Contributors

@sk-g made their first contribution in #37
@hvaara made their first contribution in #38

Evaluated LLMs (139 models)

o1-Preview-2024-09-12 (temperature=1)
Gemini-1.5-Pro-002
Llama-3.1 models
DeepSeek-V2.5
Qwen-2.5 models
Qwen-2.5-Coder models
and more

PyPI: https://pypi.org/project/bigcodebench/0.2.0.post3/

Full Changelog: v0.1.9...v0.2.0.post3

Contributors

hvaara and sk-g

Assets 3