Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The results do not match those in the paper. #54

Open
lycfight opened this issue Jan 26, 2025 · 1 comment
Open

The results do not match those in the paper. #54

lycfight opened this issue Jan 26, 2025 · 1 comment

Comments

@lycfight
Copy link

The results do not match those in the paper. Below are the results I obtained, followed by the results provided in the official repository.

(agentless) root@cpu02-2050-SWE-bench:~/Agentless# python -m swebench.harness.run_evaluation
--dataset_name princeton-nlp/SWE-bench_Verified
--predictions_path all_preds.jsonl
--max_workers 500
--run_id agentless_0126

:128: RuntimeWarning: 'swebench.harness.run_evaluation' found in sys.modules after import of package 'swebench.harness', but prior to execution of 'swebench.harness.run_evaluation'; this may result in unpredictable behaviour
Running 496 unevaluated instances...
Base image sweb.base.x86_64:latest already exists, skipping build.
Base images built successfully.
No environment images need to be built.
Found 496 existing instance images. Will reuse them.
Running 496 instances...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 496/496 [04:52<00:00, 1.70it/s]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 500
Instances submitted: 500
Instances completed: 125
Instances incomplete: 0
Instances resolved: 41
Instances unresolved: 84
Instances with empty patches: 4
Instances with errors: 371
Unstopped containers: 0
Unremoved images: 500
Report written to agentless.agentless_0126.json
(agentless) root@cpu02-2050-SWE-bench:/Agentless#
(agentless) root@cpu02-2050-SWE-bench:
/Agentless# python -m swebench.harness.run_evaluation --dataset_name princeton-nlp/SWE-bench_Verified --predictions_path agentless_swebench_verified/all_preds.jsonl --max_workers 500 --run_id agentless
_golden_0126
:128: RuntimeWarning: 'swebench.harness.run_evaluation' found in sys.modules after import of package 'swebench.harness', but prior to execution of 'swebench.harness.run_evaluation'; this may result in unpredictable behaviour
Running 496 unevaluated instances...
Base image sweb.base.x86_64:latest already exists, skipping build.
Base images built successfully.
No environment images need to be built.
Found 496 existing instance images. Will reuse them.
Running 496 instances...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 496/496 [04:52<00:00, 1.69it/s]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 500
Instances submitted: 500
Instances completed: 126
Instances incomplete: 0
Instances resolved: 48
Instances unresolved: 78
Instances with empty patches: 4
Instances with errors: 370
Unstopped containers: 0
Unremoved images: 500
Report written to agentless.agentless_golden_0126.json

This is the result from the paper.

Image

How was this 194 (38.80%) calculated?

@francistotle
Copy link

Was this ever resolved/were you able to debug the performance gap?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants