Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[eval] add evaluation workflow #4489

Open
wants to merge 72 commits into
base: main
Choose a base branch
from
Open

[eval] add evaluation workflow #4489

wants to merge 72 commits into from

Conversation

xingyaoww
Copy link
Contributor

@xingyaoww xingyaoww commented Oct 18, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Add nightly evaluation workflow to monitor the quality of OpenHands during development


Give a summary of what the PR does, explaining any non-trivial design decisions

  • Add nightly evaluation workflow to monitor the quality of OpenHands during development
  • Fix some minor parsing issues with deepseek parsing on the new editing action.

Link of any specific issues this addresses

@xingyaoww xingyaoww changed the title [eval] add nightly swe bench evaluation workflow [eval] add nightly evaluation workflow Oct 18, 2024
@xingyaoww xingyaoww changed the title [eval] add nightly evaluation workflow [eval] add evaluation workflow Oct 19, 2024
Copy link
Contributor

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

Copy link
Contributor

Evaluation outputs:

----------------------------------------------------------------------------------------------------
# of resolved: 2 / 10 (20.00%)
# of empty patch: 3 / 10 (30.00%)
# of error lines: 0 / 10 (0.00%)
# of loop: 3 / 10 (30.00%)
----------------------------------------------------------------------------------------------------
Detailed error breakdown:
Agent got stuck in a loop: 3 (30.00%)

You can download the full evaluation outputs here.

Copy link
Contributor

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

Copy link
Contributor

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report


Integration Tests Evaluation Report
Success rate: 80.00% (4/5)

instance_id success reason
t03_jupyter_write_file True
t01_fix_simple_typo True
t05_simple_browsing True
t04_git_staging True
t02_add_bash_hello False Failed to execute /workspace/hello.sh: hello
/workspace/hello.sh: line 3: /file_edit: No such file or directory
/workspace/hello.sh: line 4: /file_edit: No such file or directory.

You can download the full evaluation outputs here.

@xingyaoww xingyaoww marked this pull request as ready for review October 23, 2024 02:18
Copy link
Contributor

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

@xingyaoww xingyaoww marked this pull request as draft October 23, 2024 03:17
Copy link
Contributor

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

Copy link
Contributor

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report

of resolved: 0 / 10 (0.00%)

of empty patch: 1 / 10 (10.00%)

of error lines: 0 / 10 (0.00%)

of loop: 6 / 10 (60.00%)

Avg. num of turns per instance: 31.90
Avg. agent cost per instance: 0.11 USD
Avg. editor cost per instance: 0.00 USD
Avg. total cost per instance: 0.11 USD

Detailed error breakdown:
Agent got stuck in a loop: 6 (60.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)


Integration Tests Evaluation Report
Success rate: 80.00% (4/5)

instance_id success reason
t03_jupyter_write_file True
t01_fix_simple_typo True
t04_git_staging True
t05_simple_browsing True
t02_add_bash_hello False Failed to execute /workspace/hello.sh: hello
/workspace/hello.sh: line 3: /file_edit: No such file or directory
/workspace/hello.sh: line 4: /file_edit: No such file or directory.

You can download the full evaluation outputs here.

Copy link
Contributor

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

Copy link
Contributor

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report

of resolved: 0 / 10 (0.00%)

of empty patch: 0 / 10 (0.00%)

of error lines: 0 / 10 (0.00%)

of loop: 7 / 10 (70.00%)

Avg. num of turns per instance: 28.90
Avg. agent cost per instance: 0.10 USD
Avg. editor cost per instance: 0.00 USD
Avg. total cost per instance: 0.10 USD

Detailed error breakdown:
Agent got stuck in a loop: 7 (70.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)


Integration Tests Evaluation Report
Success rate: 80.00% (4/5)

instance_id success reason
t04_git_staging True
t03_jupyter_write_file True
t01_fix_simple_typo True
t02_add_bash_hello True
t05_simple_browsing False The answer is not found in any message. Total messages: 1. Messages: [MessageAction(content='Browse localhost:8000, and tell me the ultimate answer to life.', images_urls=None, wait_for_response=False, action='message', security_risk=None)]

You can download the full evaluation outputs here.

Copy link
Contributor

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report
Number of resolved: 1 / 10 (10.00%)
Number of empty patch: 0 / 10 (0.00%)
Number of error lines: 0 / 10 (0.00%)
Number of agent stuck in loop: 8 / 10 (80.00%)

Statistics

Avg. num of turns per instance: 33.30
Avg. agent cost per instance: 0.11 USD
Avg. editor cost per instance: 0.00 USD
Avg. total cost per instance: 0.11 USD

Detailed error breakdown:

Agent got stuck in a loop: 8 (80.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)


Integration Tests Evaluation Report
Success rate: 80.00% (4/5)

instance_id success reason
t03_jupyter_write_file True
t02_add_bash_hello True
t04_git_staging True
t05_simple_browsing False The answer is not found in any message. Total messages: 1. Messages: [MessageAction(content='Browse localhost:8000, and tell me the ultimate answer to life.', images_urls=None, wait_for_response=False, action='message', security_risk=None)]
t01_fix_simple_typo True

You can download the full evaluation outputs here.

@xingyaoww xingyaoww marked this pull request as ready for review October 23, 2024 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants