Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SWE-bench #935

Open
1 task
ShellLM opened this issue Nov 4, 2024 · 1 comment
Open
1 task

SWE-bench #935

ShellLM opened this issue Nov 4, 2024 · 1 comment
Labels
dataset public datasets and embeddings llm-benchmarks testing and benchmarking large language models python Python code, tools, info software-engineering Best practice for software engineering

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Nov 4, 2024

SWE-bench Lite

A Canonical Subset for Efficient Evaluation of Language Models as Software Engineers

Carlos E. Jimenez, John Yang, Jiayi Geng
March 19, 2024

SWE-bench

SWE-bench was designed to provide a diverse set of codebase problems that were verifiable using in-repo unit tests. The full SWE-bench test split comprises 2,294 issue-commit pairs across 12 python repositories.

Since its release, we've found that for most systems evaluating on SWE-bench, running each instance can take a lot of time and compute. We've also found that SWE-bench can be a particularly difficult benchmark, which is useful for evaluating LMs in the long term, but discouraging for systems trying to make progress in the short term.

To remedy these issues, we've released a canonical subset of SWE-bench called SWE-bench Lite. SWE-bench Lite comprises 300 instances from SWE-bench that have been sampled to be more self-contained, with a focus on evaluating functional bug fixes. SWE-bench Lite covers 11 of the original 12 repositories in SWE-bench, with a similar diversity and distribution of repositories as the original. We perform similar filtering on the SWE-bench dev set to provide 23 development instances that can be useful for active development on the SWE-bench task. We recommend future systems evaluating on SWE-bench to report numbers on SWE-bench Lite in lieu of the full SWE-bench set if necessary. You can find the source code for how SWE-bench Lite was created in SWE-bench/swebench/collect/make_lite.

Here's a list of the general criteria we used to select SWE-bench Lite instances:

  • We remove instances with images, external hyperlinks, references to specific commit shas and references to other pull requests or issues.
  • We remove instances that have fewer than 40 words in the problem statement.
  • We remove instances that edit more than 1 file.
  • We remove instances where the gold patch has more than 3 edit hunks (see patch).
  • We remove instances that create or remove files.
  • We remove instances that contain tests with error message checks.
  • Finally, we sample 300 test instances and 23 development instances from the remaining instances.

You can download SWE-bench Lite and its baselines from Hugging Face Datasets:

🤗 SWE-bench Lite
🤗 "Oracle" Retrieval Lite
🤗 BM25 Retrieval 13K Lite
🤗 BM25 Retrieval 27K Lite

SWE-bench Lite distribution across repositories. Compare to the full SWE-bench in Figure 3 of the SWE-bench paper.

SWE-bench Lite performance for our baselines. Compare to the full SWE-bench baseline performance in Table 5 of the SWE-bench paper.

Suggested labels

None

@ShellLM ShellLM added dataset public datasets and embeddings llm-benchmarks testing and benchmarking large language models python Python code, tools, info software-engineering Best practice for software engineering labels Nov 4, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Nov 4, 2024

Related content

#933 similarity score: 0.93
#908 similarity score: 0.91
#758 similarity score: 0.91
#915 similarity score: 0.87
#812 similarity score: 0.85
#749 similarity score: 0.84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset public datasets and embeddings llm-benchmarks testing and benchmarking large language models python Python code, tools, info software-engineering Best practice for software engineering
Projects
None yet
Development

No branches or pull requests

1 participant