Release BigCodeBench v0.2.0
Breaking Change
- No more waiting! The evalution now fully supports batch inference!
- No more environment configs! The code execution is done by a remote API endpoint by default, and can be customized.
- No more multiple commands!
bigcodebench.evaluate
will be good enough to handle most cases.
What's Changed
- add multiprocessing support for sanitization step by @sk-g in #37
- Remove extra period in task BigCodeBench/16 by @hvaara in #38
- Await futures in progress checker by @hvaara in #48
- A few args have been added to this version, including
--direct_completion
and--local_execute
. See Advanced Usage for the details.
Dataset maintainence
- The benchmark data has been bumped to
v0.1.2
. You can load the dataset withfrom datasets import load_data; ds = load_data("bigcode/bigcodebench", split="v0.1.2")
BigCodeBench/16
: removed periodBigCodeBench/37
: added pandas requirementBigCodeBench/178
: removedurlib
requirementBigCodeBench/241
: added required plot titleBigCodeBench/267
: added required plot titleBigCodeBench/760
: changed the import ofdatetime
BigCodeBench/1006
: replaced test links due to the potential connection block
New Contributors
Evaluated LLMs (139 models)
- o1-Preview-2024-09-12 (temperature=1)
- Gemini-1.5-Pro-002
- Llama-3.1 models
- DeepSeek-V2.5
- Qwen-2.5 models
- Qwen-2.5-Coder models
- and more
PyPI: https://pypi.org/project/bigcodebench/0.2.0.post3/
Full Changelog: v0.1.9...v0.2.0.post3