-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[eval] add evaluation workflow #4489
base: main
Are you sure you want to change the base?
Conversation
Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly. |
Evaluation outputs:
You can download the full evaluation outputs here. |
Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly. |
Trigger by: Pull Request (eval-this label on PR #4489) Integration Tests Evaluation Report
You can download the full evaluation outputs here. |
Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly. |
Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly. |
Trigger by: Pull Request (eval-this label on PR #4489)
|
instance_id | success | reason |
---|---|---|
t03_jupyter_write_file | True | |
t01_fix_simple_typo | True | |
t04_git_staging | True | |
t05_simple_browsing | True | |
t02_add_bash_hello | False | Failed to execute /workspace/hello.sh: hello |
/workspace/hello.sh: line 3: /file_edit: No such file or directory | ||
/workspace/hello.sh: line 4: /file_edit: No such file or directory. |
You can download the full evaluation outputs here.
Hi! I started running the evaluation on your PR. You will receive a comment with the results shortly. |
Trigger by: Pull Request (eval-this label on PR #4489)
|
instance_id | success | reason |
---|---|---|
t04_git_staging | True | |
t03_jupyter_write_file | True | |
t01_fix_simple_typo | True | |
t02_add_bash_hello | True | |
t05_simple_browsing | False | The answer is not found in any message. Total messages: 1. Messages: [MessageAction(content='Browse localhost:8000, and tell me the ultimate answer to life.', images_urls=None, wait_for_response=False, action='message', security_risk=None)] |
You can download the full evaluation outputs here.
Trigger by: Pull Request (eval-this label on PR #4489) StatisticsAvg. num of turns per instance: 33.30 Detailed error breakdown:Agent got stuck in a loop: 8 (80.00%)
|
instance_id | success | reason |
---|---|---|
t03_jupyter_write_file | True | |
t02_add_bash_hello | True | |
t04_git_staging | True | |
t05_simple_browsing | False | The answer is not found in any message. Total messages: 1. Messages: [MessageAction(content='Browse localhost:8000, and tell me the ultimate answer to life.', images_urls=None, wait_for_response=False, action='message', security_risk=None)] |
t01_fix_simple_typo | True |
You can download the full evaluation outputs here.
End-user friendly description of the problem this fixes or functionality that this introduces
Add nightly evaluation workflow to monitor the quality of OpenHands during development
Give a summary of what the PR does, explaining any non-trivial design decisions
Link of any specific issues this addresses