Reproducing swebench-lite results #25

justinchiu-cohere · 2024-08-19T15:47:51Z

Thanks for releasing the repo, as well as the trajectories for swebench-lite! I am trying to reproduce the results with gpt-4o, but am seeing a fix rate of 59/300, as opposed to the 27.33% reported.

Other than the --plausible flag in rerank, are there any other possible causes for this?
Did you notice a large amount of variance between runs?
I changed the prompts slightly, adding a sentence before # Examples to clarify that we are giving output examples. Could this lead to large changes in resolution?

The text was updated successfully, but these errors were encountered:

brutalsavage · 2024-08-19T15:53:27Z

Hi Justin

We had ran agentless multiple times ourselves and the while the results have some variance, it should not be as large as down to 59/300. I would expect a range between ~70s/300 to ~80s/300 (even without plausible flag). As a reference you can see that recently OpenAI had ran our configuration and got 24.3% and it seems they only generate 1 sample per bug.

Please check your configurations are correct, you can refer to the README_swebenchlite.md file to completely recover our experimental settings.

Thanks

justinchiu-cohere · 2024-08-21T20:30:38Z

I tried a fresh clone, and ran the commands in README_swebenchlite.md again. However, after the repair step I'm seeing wc -l results/repair_run_1/output.jsonl == 284, as opposed to the expected 300 in the v0.1.0 release. Oddly, this is also true for results/repair_run_2/output.jsonl as well.

I'll debug a bit more, try evaluating the locations, and also try feeding in the locations from the v0.1.0 release to the repair step instead. Could oai gpt outages cause some of the prompts to fail midway, but be labeled as completed and thus not run again on a subsequent repair call?

GCVulnerability · 2024-08-25T14:04:09Z

I noticed you added a lot of new models like deepseek and gpt-4o-mini, do you have a reference evaluation result on these models? Similar to Justin, I can't seem to fully reproduce the reported performance. But the high cost prevents me from running gpt-4o multiple times.

hanzhou032 · 2024-10-07T00:08:01Z

I am also very curious about the gpt-4o-mini results. Any follow-up updates on that?

workworkwc3 · 2024-12-12T16:14:38Z

I am also having trouble reproducing the swebench-lite results with my initial run. I plan to continue working out any possible issues or oversights from my end.

The swebench README has a lot of information and many commands. There are even commands to run (related to "additional repair commands") that are hidden as the section must be expanded. Due to at least this specific repair phase, it seems simply copy and pasting the commands from the README would not work in replicating the swebench-lite results (based on my understanding of the repair phase generated from the 4 sets of edit locations, and then the README only contains commands for running the selected regression tests on the first set of repairs/edits).

I am expecting that at least my own issues in replication arise from my unfamiliarity and limited understanding of each of the many steps, and thus the ability to run the correct commands. Somehow I run into errors in the scripts as well. I wonder if others have the same issue.

Providing a bash script with the exact commands to run that produced your exact results on the leaderboards would help users including myself out with reproduction. Is this possible? Also if anyone in the community can share your bash script here or in a PR that would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing swebench-lite results #25

Reproducing swebench-lite results #25

justinchiu-cohere commented Aug 19, 2024

brutalsavage commented Aug 19, 2024

justinchiu-cohere commented Aug 21, 2024 •

edited

Loading

GCVulnerability commented Aug 25, 2024

hanzhou032 commented Oct 7, 2024

workworkwc3 commented Dec 12, 2024 •

edited

Loading

Reproducing swebench-lite results #25

Reproducing swebench-lite results #25

Comments

justinchiu-cohere commented Aug 19, 2024

brutalsavage commented Aug 19, 2024

justinchiu-cohere commented Aug 21, 2024 • edited Loading

GCVulnerability commented Aug 25, 2024

hanzhou032 commented Oct 7, 2024

workworkwc3 commented Dec 12, 2024 • edited Loading

justinchiu-cohere commented Aug 21, 2024 •

edited

Loading

workworkwc3 commented Dec 12, 2024 •

edited

Loading