Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing swebench-lite results #25

Open
justinchiu-cohere opened this issue Aug 19, 2024 · 5 comments
Open

Reproducing swebench-lite results #25

justinchiu-cohere opened this issue Aug 19, 2024 · 5 comments

Comments

@justinchiu-cohere
Copy link

Thanks for releasing the repo, as well as the trajectories for swebench-lite! I am trying to reproduce the results with gpt-4o, but am seeing a fix rate of 59/300, as opposed to the 27.33% reported.

  1. Other than the --plausible flag in rerank, are there any other possible causes for this?
  2. Did you notice a large amount of variance between runs?
  3. I changed the prompts slightly, adding a sentence before # Examples to clarify that we are giving output examples. Could this lead to large changes in resolution?
@brutalsavage
Copy link
Contributor

Hi Justin

We had ran agentless multiple times ourselves and the while the results have some variance, it should not be as large as down to 59/300. I would expect a range between ~70s/300 to ~80s/300 (even without plausible flag). As a reference you can see that recently OpenAI had ran our configuration and got 24.3% and it seems they only generate 1 sample per bug.

Please check your configurations are correct, you can refer to the README_swebenchlite.md file to completely recover our experimental settings.

Thanks

@justinchiu-cohere
Copy link
Author

justinchiu-cohere commented Aug 21, 2024

I tried a fresh clone, and ran the commands in README_swebenchlite.md again. However, after the repair step I'm seeing wc -l results/repair_run_1/output.jsonl == 284, as opposed to the expected 300 in the v0.1.0 release. Oddly, this is also true for results/repair_run_2/output.jsonl as well.

I'll debug a bit more, try evaluating the locations, and also try feeding in the locations from the v0.1.0 release to the repair step instead. Could oai gpt outages cause some of the prompts to fail midway, but be labeled as completed and thus not run again on a subsequent repair call?

@GCVulnerability
Copy link

I noticed you added a lot of new models like deepseek and gpt-4o-mini, do you have a reference evaluation result on these models? Similar to Justin, I can't seem to fully reproduce the reported performance. But the high cost prevents me from running gpt-4o multiple times.

@hanzhou032
Copy link

I am also very curious about the gpt-4o-mini results. Any follow-up updates on that?

@workworkwc3
Copy link

workworkwc3 commented Dec 12, 2024

I am also having trouble reproducing the swebench-lite results with my initial run. I plan to continue working out any possible issues or oversights from my end.

The swebench README has a lot of information and many commands. There are even commands to run (related to "additional repair commands") that are hidden as the section must be expanded. Due to at least this specific repair phase, it seems simply copy and pasting the commands from the README would not work in replicating the swebench-lite results (based on my understanding of the repair phase generated from the 4 sets of edit locations, and then the README only contains commands for running the selected regression tests on the first set of repairs/edits).

I am expecting that at least my own issues in replication arise from my unfamiliarity and limited understanding of each of the many steps, and thus the ability to run the correct commands. Somehow I run into errors in the scripts as well. I wonder if others have the same issue.

Providing a bash script with the exact commands to run that produced your exact results on the leaderboards would help users including myself out with reproduction. Is this possible? Also if anyone in the community can share your bash script here or in a PR that would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants