[eval] add evaluation workflow #4489

xingyaoww · 2024-10-18T20:49:06Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Add nightly evaluation workflow to monitor the quality of OpenHands during development

Give a summary of what the PR does, explaining any non-trivial design decisions

Add nightly evaluation workflow to monitor the quality of OpenHands during development
Fix some minor parsing issues with deepseek parsing on the new editing action.

Link of any specific issues this addresses

…workflow

github-actions · 2024-10-19T23:23:37Z

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

github-actions · 2024-10-19T23:48:40Z

Evaluation outputs:

----------------------------------------------------------------------------------------------------
# of resolved: 2 / 10 (20.00%)
# of empty patch: 3 / 10 (30.00%)
# of error lines: 0 / 10 (0.00%)
# of loop: 3 / 10 (30.00%)
----------------------------------------------------------------------------------------------------
Detailed error breakdown:
Agent got stuck in a loop: 3 (30.00%)

You can download the full evaluation outputs here.

github-actions · 2024-10-23T02:07:07Z

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

github-actions · 2024-10-23T02:10:46Z

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report

Integration Tests Evaluation Report
Success rate: 80.00% (4/5)

instance_id	success	reason
t03_jupyter_write_file	True
t01_fix_simple_typo	True
t05_simple_browsing	True
t04_git_staging	True
t02_add_bash_hello	False	Failed to execute /workspace/hello.sh: hello
		/workspace/hello.sh: line 3: /file_edit: No such file or directory
		/workspace/hello.sh: line 4: /file_edit: No such file or directory.

You can download the full evaluation outputs here.

github-actions · 2024-10-23T02:21:00Z

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

github-actions · 2024-10-23T03:19:50Z

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

github-actions · 2024-10-23T03:32:31Z

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report

of resolved: 0 / 10 (0.00%)

of empty patch: 1 / 10 (10.00%)

of error lines: 0 / 10 (0.00%)

of loop: 6 / 10 (60.00%)

Avg. num of turns per instance: 31.90
Avg. agent cost per instance: 0.11 USD
Avg. editor cost per instance: 0.00 USD
Avg. total cost per instance: 0.11 USD

Detailed error breakdown:
Agent got stuck in a loop: 6 (60.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)

Integration Tests Evaluation Report
Success rate: 80.00% (4/5)

instance_id	success	reason
t03_jupyter_write_file	True
t01_fix_simple_typo	True
t04_git_staging	True
t05_simple_browsing	True
t02_add_bash_hello	False	Failed to execute /workspace/hello.sh: hello
		/workspace/hello.sh: line 3: /file_edit: No such file or directory
		/workspace/hello.sh: line 4: /file_edit: No such file or directory.

You can download the full evaluation outputs here.

github-actions · 2024-10-23T03:40:34Z

Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly.

github-actions · 2024-10-23T04:32:20Z

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report

of resolved: 0 / 10 (0.00%)

of empty patch: 0 / 10 (0.00%)

of error lines: 0 / 10 (0.00%)

of loop: 7 / 10 (70.00%)

Avg. num of turns per instance: 28.90
Avg. agent cost per instance: 0.10 USD
Avg. editor cost per instance: 0.00 USD
Avg. total cost per instance: 0.10 USD

Detailed error breakdown:
Agent got stuck in a loop: 7 (70.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)

Integration Tests Evaluation Report
Success rate: 80.00% (4/5)

instance_id	success	reason
t04_git_staging	True
t03_jupyter_write_file	True
t01_fix_simple_typo	True
t02_add_bash_hello	True
t05_simple_browsing	False	The answer is not found in any message. Total messages: 1. Messages: [MessageAction(content='Browse localhost:8000, and tell me the ultimate answer to life.', images_urls=None, wait_for_response=False, action='message', security_risk=None)]

You can download the full evaluation outputs here.

github-actions · 2024-10-23T04:48:54Z

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report
Number of resolved: 1 / 10 (10.00%)
Number of empty patch: 0 / 10 (0.00%)
Number of error lines: 0 / 10 (0.00%)
Number of agent stuck in loop: 8 / 10 (80.00%)

Statistics

Avg. num of turns per instance: 33.30
Avg. agent cost per instance: 0.11 USD
Avg. editor cost per instance: 0.00 USD
Avg. total cost per instance: 0.11 USD

Detailed error breakdown:

Agent got stuck in a loop: 8 (80.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)

Integration Tests Evaluation Report
Success rate: 80.00% (4/5)

instance_id	success	reason
t03_jupyter_write_file	True
t02_add_bash_hello	True
t04_git_staging	True
t05_simple_browsing	False	The answer is not found in any message. Total messages: 1. Messages: [MessageAction(content='Browse localhost:8000, and tell me the ultimate answer to life.', images_urls=None, wait_for_response=False, action='message', security_risk=None)]
t01_fix_simple_typo	True

You can download the full evaluation outputs here.

openhands-agent and others added 8 commits October 18, 2024 19:04

Add SWE-Bench evaluation workflow

b02bb82

Update SWE-Bench evaluation workflow with correct Poetry setup

0527535

Add Google Cloud Storage upload step to SWE-Bench evaluation workflow

28b14b9

tweak dispatch

1727479

support remote runtime for integration tests

54134db

comment out & debug

c78afdc

fix out-dated workflow

09428f6

tweak key

c85301e

xingyaoww mentioned this pull request Oct 18, 2024

[eval] add nightly swe bench evaluation workflow #4488

Closed

1 task

xingyaoww changed the title ~~[eval] add nightly swe bench evaluation workflow~~ [eval] add nightly evaluation workflow Oct 18, 2024

xingyaoww added 5 commits October 18, 2024 21:06

debug: add swebench to run

1e0fee4

fix eval infer remote

04e8ee2

try fix eval runner

8182326

support eval-this tag and comment the result

0826c55

Merge commit '5cc16cb82a94dfdff519dd8d74c363e7caa94ec4' into xw/eval-…

62a3738

…workflow

xingyaoww changed the title ~~[eval] add nightly evaluation workflow~~ [eval] add evaluation workflow Oct 19, 2024

xingyaoww added the eval-this label Oct 19, 2024

try fix comment

409dcb0

xingyaoww added eval-this and removed eval-this labels Oct 19, 2024

fix comment

9985ca0

xingyaoww added eval-this and removed eval-this labels Oct 19, 2024

add permission to write pr

b82bef0

xingyaoww added eval-this and removed eval-this labels Oct 19, 2024

remove label

60fc822

xingyaoww removed the eval-this label Oct 20, 2024

xingyaoww added the eval-this label Oct 23, 2024

re-enable swebench eval

9f9fd9a

xingyaoww marked this pull request as ready for review October 23, 2024 02:18

also add action parser fix

83217e3

xingyaoww added eval-this and removed eval-this labels Oct 23, 2024

stop run swebench to debug

b1ee458

xingyaoww marked this pull request as draft October 23, 2024 03:17

xingyaoww added 2 commits October 22, 2024 23:18

fix dpsk

81ed1f3

get swebench eval back

39357eb

xingyaoww added eval-this and removed eval-this labels Oct 23, 2024

update summarize output to display better in github comment

8eb2f06

xingyaoww added eval-this and removed eval-this labels Oct 23, 2024

run on full swebench during CI

2f0985d

xingyaoww marked this pull request as ready for review October 23, 2024 13:33

xingyaoww added 2 commits October 23, 2024 09:34

remove end ---

36e6f6c

add commit info

9ab72c4

xingyaoww requested review from neubig, enyst and rbren October 23, 2024 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[eval] add evaluation workflow #4489

[eval] add evaluation workflow #4489

xingyaoww commented Oct 18, 2024 •

edited

Loading

github-actions bot commented Oct 19, 2024

github-actions bot commented Oct 19, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

[eval] add evaluation workflow #4489

Are you sure you want to change the base?

[eval] add evaluation workflow #4489

Conversation

xingyaoww commented Oct 18, 2024 • edited Loading

github-actions bot commented Oct 19, 2024

github-actions bot commented Oct 19, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

Trigger by: Pull Request (eval-this label on PR #4489) SWE-Bench Evaluation Report

of resolved: 0 / 10 (0.00%)

of empty patch: 1 / 10 (10.00%)

of error lines: 0 / 10 (0.00%)

of loop: 6 / 10 (60.00%)

Avg. num of turns per instance: 31.90 Avg. agent cost per instance: 0.11 USD Avg. editor cost per instance: 0.00 USD Avg. total cost per instance: 0.11 USD

Detailed error breakdown: Agent got stuck in a loop: 6 (60.00%) Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

Trigger by: Pull Request (eval-this label on PR #4489) SWE-Bench Evaluation Report

of resolved: 0 / 10 (0.00%)

of empty patch: 0 / 10 (0.00%)

of error lines: 0 / 10 (0.00%)

of loop: 7 / 10 (70.00%)

Avg. num of turns per instance: 28.90 Avg. agent cost per instance: 0.10 USD Avg. editor cost per instance: 0.00 USD Avg. total cost per instance: 0.10 USD

Detailed error breakdown: Agent got stuck in a loop: 7 (70.00%) Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)

github-actions bot commented Oct 23, 2024

Statistics

Detailed error breakdown:

Agent got stuck in a loop: 8 (80.00%) Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)

xingyaoww commented Oct 18, 2024 •

edited

Loading

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report

Avg. num of turns per instance: 31.90
Avg. agent cost per instance: 0.11 USD
Avg. editor cost per instance: 0.00 USD
Avg. total cost per instance: 0.11 USD

Detailed error breakdown:
Agent got stuck in a loop: 6 (60.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)

Trigger by: Pull Request (eval-this label on PR #4489)
SWE-Bench Evaluation Report

Avg. num of turns per instance: 28.90
Avg. agent cost per instance: 0.10 USD
Avg. editor cost per instance: 0.00 USD
Avg. total cost per instance: 0.10 USD

Detailed error breakdown:
Agent got stuck in a loop: 7 (70.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)

Agent got stuck in a loop: 8 (80.00%)
Agent reached maximum iteration in headless mode, task stopped. Current iteration: 50.00, max iteration: 50.00: 1 (10.00%)