The results do not match those in the paper. #54

lycfight · 2025-01-26T05:13:35Z

The results do not match those in the paper. Below are the results I obtained, followed by the results provided in the official repository.

(agentless) root@cpu02-2050-SWE-bench:~/Agentless# python -m swebench.harness.run_evaluation
--dataset_name princeton-nlp/SWE-bench_Verified
--predictions_path all_preds.jsonl
--max_workers 500
--run_id agentless_0126

:128: RuntimeWarning: 'swebench.harness.run_evaluation' found in sys.modules after import of package 'swebench.harness', but prior to execution of 'swebench.harness.run_evaluation'; this may result in unpredictable behaviour
Running 496 unevaluated instances...
Base image sweb.base.x86_64:latest already exists, skipping build.
Base images built successfully.
No environment images need to be built.
Found 496 existing instance images. Will reuse them.
Running 496 instances...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 496/496 [04:52<00:00, 1.70it/s]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 500
Instances submitted: 500
Instances completed: 125
Instances incomplete: 0
Instances resolved: 41
Instances unresolved: 84
Instances with empty patches: 4
Instances with errors: 371
Unstopped containers: 0
Unremoved images: 500
Report written to agentless.agentless_0126.json
(agentless) root@cpu02-2050-SWE-bench:/Agentless#
(agentless) root@cpu02-2050-SWE-bench:/Agentless# python -m swebench.harness.run_evaluation --dataset_name princeton-nlp/SWE-bench_Verified --predictions_path agentless_swebench_verified/all_preds.jsonl --max_workers 500 --run_id agentless
_golden_0126
:128: RuntimeWarning: 'swebench.harness.run_evaluation' found in sys.modules after import of package 'swebench.harness', but prior to execution of 'swebench.harness.run_evaluation'; this may result in unpredictable behaviour
Running 496 unevaluated instances...
Base image sweb.base.x86_64:latest already exists, skipping build.
Base images built successfully.
No environment images need to be built.
Found 496 existing instance images. Will reuse them.
Running 496 instances...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 496/496 [04:52<00:00, 1.69it/s]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 500
Instances submitted: 500
Instances completed: 126
Instances incomplete: 0
Instances resolved: 48
Instances unresolved: 78
Instances with empty patches: 4
Instances with errors: 370
Unstopped containers: 0
Unremoved images: 500
Report written to agentless.agentless_golden_0126.json

This is the result from the paper.

How was this 194 (38.80%) calculated?

francistotle · 2025-02-03T14:34:58Z

Was this ever resolved/were you able to debug the performance gap?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The results do not match those in the paper. #54

The results do not match those in the paper. #54

lycfight commented Jan 26, 2025

francistotle commented Feb 3, 2025

The results do not match those in the paper. #54

The results do not match those in the paper. #54

Comments

lycfight commented Jan 26, 2025

francistotle commented Feb 3, 2025