Skip to content

Release BigCodeBench v0.2.0

Compare
Choose a tag to compare
@terryyz terryyz released this 06 Oct 08:28
· 62 commits to main since this release

Breaking Change

  • No more waiting! The evalution now fully supports batch inference!
  • No more environment configs! The code execution is done by a remote API endpoint by default, and can be customized.
  • No more multiple commands! bigcodebench.evaluate will be good enough to handle most cases.

What's Changed

  • add multiprocessing support for sanitization step by @sk-g in #37
  • Remove extra period in task BigCodeBench/16 by @hvaara in #38
  • Await futures in progress checker by @hvaara in #48
  • A few args have been added to this version, including --direct_completion and --local_execute. See Advanced Usage for the details.

Dataset maintainence

  • The benchmark data has been bumped to v0.1.2. You can load the dataset with from datasets import load_data; ds = load_data("bigcode/bigcodebench", split="v0.1.2")
  • BigCodeBench/16: removed period
  • BigCodeBench/37: added pandas requirement
  • BigCodeBench/178: removed urlib requirement
  • BigCodeBench/241: added required plot title
  • BigCodeBench/267: added required plot title
  • BigCodeBench/760: changed the import of datetime
  • BigCodeBench/1006: replaced test links due to the potential connection block

New Contributors

Evaluated LLMs (139 models)

  • o1-Preview-2024-09-12 (temperature=1)
  • Gemini-1.5-Pro-002
  • Llama-3.1 models
  • DeepSeek-V2.5
  • Qwen-2.5 models
  • Qwen-2.5-Coder models
  • and more

PyPI: https://pypi.org/project/bigcodebench/0.2.0.post3/

Full Changelog: v0.1.9...v0.2.0.post3