Reproduction metrics mismatch with SWE-Bench-Lite submission: Got 34.67% v.s. Claimed 40.67% #47

HejiaZ2023 · 2025-01-09T03:08:06Z

Hi Agentless developers!
We have tried to reproduce your Agentless-1.5 + Claude-3.5 Sonnet (20241022) submission on SWE-Bench-Lite; But we only got a resolved rate of 34.67% (104/300), instead of the published 40.67% (122/300).

{
    "total_instances": 300,
    "submitted_instances": 300,
    "completed_instances": 298,
    "resolved_instances": 104,
    "unresolved_instances": 194,
    "empty_patch_instances": 2,
    "error_instances": 0,
    "unstopped_instances": 0,
......

Is such a variance (+-3%) also noticed by you between runs?
Would it be caused by the possibly outdated script (which I mentioned in Request to Update README_swebench.md to Reflect Latest Claude Support Commit #44)? I feel confused as you have published a large commit (59dc4b7) with several new features, but not a single line is updated in your reproduction script README_swebench.md.

Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction metrics mismatch with SWE-Bench-Lite submission: Got 34.67% v.s. Claimed 40.67% #47

Reproduction metrics mismatch with SWE-Bench-Lite submission: Got 34.67% v.s. Claimed 40.67% #47

HejiaZ2023 commented Jan 9, 2025 •

edited

Loading

Reproduction metrics mismatch with SWE-Bench-Lite submission: Got 34.67% v.s. Claimed 40.67% #47

Reproduction metrics mismatch with SWE-Bench-Lite submission: Got 34.67% v.s. Claimed 40.67% #47

Comments

HejiaZ2023 commented Jan 9, 2025 • edited Loading

HejiaZ2023 commented Jan 9, 2025 •

edited

Loading