Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(workflow): Implement a simplified CoAct workflow #3770

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

ryanhoangt
Copy link
Contributor

@ryanhoangt ryanhoangt commented Sep 7, 2024

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

  • This PR implements a simplified multi-agent workflow inspired by the CoAct paper.
  • Currently, in swe-bench eval, there are complex instances that OpenHands fails, especially ones that single CodeActAgent overlooks the buggy location. If we have a grounding test case for the issue, this workflow seems to help.
  • An overkill-ish successful trajectory with replanning can be found here.
  • A task which CoActPlannerAgent finished but CodeActAgent failed (I expected both to be able to complete it):

Give a summary of what the PR does, explaining any non-trivial design decisions

  • Modify CodeAct to make it accept delegated task.
  • Implement 2 new agents, planner and executor with the same abilities as CodeAct, different system prompts, additional action parsers.
  • Nit: adjust the delegate message shown on UI.

Some next steps to improve this may be:

  • Try eval on some swe-bench-lite instances.
  • Adjust the system/user prompts and few-shot examples to further specialize the two agents. Also define the structure for the plan (e.g., its components, etc). 2 agents can now cooperate to finish a swe-bench issue.
  • Use meta prompt to reinforce the actions of agents, to make sure it follows the workflow.
  • Experiment with ability for the global planner to refuse the replan request from executor.
  • Implement ability for the delegated agent (e.g., BrowsingAgent or CoActExecutorAgent) to persist its history through the multi-turn conversation.

Link of any specific issues this addresses

#3077

@tobitege
Copy link
Collaborator

tobitege commented Sep 7, 2024

just fyi, the integration tests seem to fail because of some

"action_suffix": "browse"

in some results.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 7, 2024

just fyi, the integration tests seem to fail because of some

"action_suffix": "browse"

in some results.

Thanks, still waiting for reviews on it, if it is good to go I will look into the tests.

@neubig neubig self-requested a review September 8, 2024 13:32
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

@ryanhoangt
Copy link
Contributor Author

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

Thanks Prof., I'll do and update how it goes soon.

@tobitege
Copy link
Collaborator

It might be in the paper(s), but I don't quite like that the prompts now talk of agent, while anywhere else it is assistant. 🤔

@enyst
Copy link
Collaborator

enyst commented Sep 16, 2024

@ryanhoangt @ketan1741
I'm debugging this a bit, and I don't know yet enough answers we need, but just so you know:

  • The PR is merged. Please see here for delegates trajectories on a simple test: HuggingFace or attached CoActPlannerAgent.zip

  • IMHO django__django-10914 looks like a fluke. It's funny... the agent has to find FILE_UPLOAD_PERMISSIONS, in the task FILE_UPLOAD_PERMISSION is used interchangeably with FILE_UPLOAD_PERMISSIONS, and the agent read the right file, but 1) on main it concluded that it has FILE_UPLOAD_PERMISSIONS and modifies it, 2) on branch it concluded it cannot find FILE_UPLOAD_PERMISSION so it has to add it and fails the task.
    In the example test above it succeeded (repeatedly) on the PR branch. 🤷

  • In part, we are making CodeAct a delegate and it's not yet ready for it. 😅 For example, a thing that happens is that it's using get_user_message() here which has to return a MessageAction. But a delegate doesn't have any MessageAction, and the method has no protection for delegates (it doesn't stop at start_id or AgentDelegateAction) so it's hijacking a MessageAction from the parent agent. This shouldn't happen (fix coming up).
    It doesn't influence anything else, but there might be other issues of this kind...?
    FWIW other delegates use something like this.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 17, 2024

@enyst thanks for the quick fix and the insightful comments.

Re the django__django-10914 instance, it's interesting to know there're some unclear specs here, I don't know which behavior I should advocate 😅. But I think there're not much we can do, it's more to do with the LLM than the agent implementation.

we are making CodeAct a delegate and it's not yet ready for it.

Yeah I can also imagine some other weird cases that might come up, and we will need more work to make the agent (with delegation) robustly usable on the platform.

Tbh this is also not a version that I expected initially (i.e., using the grouding test case to help the agent resolve the issue) and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps. Should I close this PR for now?


Hi @jatinganhotra, regarding the full evaluation, I think it will likely further lower the score, as the remaining instances are not included in swe-bench-verified. IMO It would be more beneficial to run the full eval after validating the agent gets good scores on this "verified" subset.

On All Hands's Slack we have a dedicated channel named "workflows" to discuss about this topic. We'd love to chat more about improvement ideas there.

@enyst
Copy link
Collaborator

enyst commented Sep 17, 2024

Tbh this is also not a version that I expected initially and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps.

I also hope that understanding and debugging swe-bench issues will get easier after PR 3808 (simpler execution flow in the backend, fix delegates) and PR 3413 (logging prompts).

Should I close this PR for now?

Sorry, that's not what I meant. Up to you, of course, but IMHO it's worth looking into it a bit more, fixing stuff around what happens. It is quite amazing that it solves 6 more than main, but for some reason we still don't know it breaks in 9. FWIW the 2 tests I worked with were from those 9 that pass on main but failed on branch, and they passed every time.

Two minor thoughts:

  • the PR sends the new actions as serialized in json, but CodeAct doesn't do that usually, it sends strings as in a dialogue with the llm. It may be less confusing to the LLM if we are consistent and add the new ones around here in a similar format, extracting the useful info. Small note, Claude doesn't do as well with JSON as GPT-4o in my experience, not because it can't but it seems to need some prompting or gets confused at times.
  • what do you think, what if we give Claude the prompt pieces, and ask it to rephrase for clarity, precision, and all info needed?

One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent or show that it doesn't affect its performance. Just my thought, not sure how others think about it?

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 17, 2024

Sounds good, I can do the debugging in the next few days and try running a new eval to obtain the trajectory of the executor. In previous eval, bugs such as missing the </execute...> tag in the stop sequences also negatively affected the performance, as it made the planner to execute the plan itself. Also happy to hear other ideas on improving this!


Also just to clarify a bit, the 6/9 diff above are for instances at medium (15m-1h) level. There's also a diagram for instances at easy (<15m) level (the 2 instances you ran is in the red subset)
Venn Diagram of Resolved Easy Instance IDs

I looked into the trajectory in this easy level, one observation that make CoAct failed on many instances is that the planner very often came up with plans that have more steps than needed (often the plan contained 3 phases but only 1 of them was actually required to resolve the issue).


Some questions:

the PR sends the new actions as serialized in json

Can you clarify this? I'm not sure I did any json serialization in the PR...

One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent

I think there're no changes to CodeAct execution in this PR, except the message property of AgentDelegateAction (which CodeAct should not use in swe-bench). Can you point out some changes that you're concerned?

what do you think, what if we give Claude the prompt pieces, and ask it to rephrase for clarity, precision, and all info needed?

Can you give an example of this?

@enyst
Copy link
Collaborator

enyst commented Sep 17, 2024

Can you clarify this? I'm not sure I did any json serialization in the PR...

🤔 I guess it could be that it's the default serialization, so if they're not added at that point in CodeAct, maybe it takes it by default? Not fully sure, but anyway I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Action
{
"id":20
"timestamp":"2024-09-16T19:46:29.172006"
"source":"agent"
"message":"The new test method has been added successfully. Now all three phases of the task are complete. Let's summarize what we've done:

1. Modified the FILE_UPLOAD_PERMISSIONS setting in global_settings.py to 0o644.
....

@enyst
Copy link
Collaborator

enyst commented Sep 17, 2024

I attach a couple of files that are very strange to me. They're a few of the prompts sent to the LLM during django__django-11964.
prompts.zip

prompt_039.py is the first prompt to Planner after the Executor returns with success. It has an observation using JSON. The last lines:

...
</expected_content>
</execute_global_plan>

----------

OBSERVATION:
{'content': "Thank you for the update. The changes for all three phases have been successfully implemented. Let's review the modifications:\n\n1. We added the `__str__` method to the `TextChoices` class.\n2. We added the `__str__` method to the `IntegerChoices` class.\n3. We added the `__str__` method to the base `Choices` class.\n\nThese changes should address the issue of differing types for TextChoices and IntegerChoices fields. The `__str__` method in each class now ensures that the string representation of the enum value is returned.\n\nTo summarize the changes made:\n\n1. In the `Choices` base class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\n2. In the `IntegerChoices` class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\n3. In the `TextChoices` class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\nThese modifications should resolve the issue of differing types for TextChoices and IntegerChoices fields. The value returned by the getter of the field will now be of the same type as the one created by enum.Enum (enum value).\n\nThe task is now complete."}


ENVIRONMENT REMINDER: You have 21 turns left to complete the task. When finished reply with <finish></finish>.

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt:

prompt_040.py

You are an autonomous intelligent programming agent playing the role of a subordinate employee...
...
NOW, LET'S START!
The user message is: "Now, let's come up with 2 global plans sequentially.
- First, examine the codebase and locate the relevant code for the issue. Then we'll come up with the FIRST detailed plan with all the edits to resolve it.
- After the local agent finishes executing the first plan, navigate the codebase again and come up with the SECOND detailed plan to create exactly ONE unit test at the correct location to verify the change has actually resolved the issue. As the LAST phase, explicitly tell the executor to execute it after creating that test case. If the test failed and after debugging it the local executor believes the previous fixes are incorrect, request for a new plan and include the error with explanation for that request."

# Phases
## Phase 1
- description: Create a new test file for enum-related tests.
 ...
## Phase 3
- description: Execute the newly created test case.
- reason: We need to verify that our changes have resolved the issue and that the test passes.
- expected_state: The test case is executed and passes successfully.


ENVIRONMENT REMINDER: You have 20 turns left to complete the task. When finished reply with <finish></finish>.

Please correct me if I misunderstand how this should work. @ryanhoangt @ketan1741
I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 18, 2024

I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Re the json in the visualizer, seems like it is because we don't format the finish action yet.

prompt_039.log - It has an observation using JSON.

Good catch, this seems to be another bug. Might be because this action is not handled properly:

return AgentFinishAction(thought=thought, outputs={'content': thought})

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt

Yeah I also noticed this issue. My intention is to make the Planner include the full user message (hence the full problem statement in swe-bench) to give executor some more context, but sometimes it included the message from the few-shot examples, or the "Now, let's come up with 2 global plans sequentially." as you saw, which is problematic.

I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

"let's come up with 2 global plans sequentially" - this is an extra piece of prompt used only in swe-bench evaluation for CoActPlanner. Similar to CodeActSWEAgent below, it can be used to steer the agent a bit to be better at a specific task, but I'm not sure the current "2 global plans" is the optimal way to go. In CodeActAgent there're many cases where the agent just fixed the issues without creating any tests.

if agent_class == 'CodeActSWEAgent':
instruction = (
'We are currently solving the following issue within our repository. Here is the issue text:\n'
'--- BEGIN ISSUE ---\n'
f'{instance.problem_statement}\n'
'--- END ISSUE ---\n\n'
)
if USE_HINT_TEXT and instance.hints_text:
instruction += (
f'--- BEGIN HINTS ---\n{instance.hints_text}\n--- END HINTS ---\n'
)
instruction += f"""Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want.
Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command.
When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>.
Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`.
NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line!
IMPORTANT TIPS:
1. Always start by trying to replicate the bug that the issues discusses.
If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug.
Then start trying to fix it.
When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed.
If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file,
so that you can be sure that the script indeed ran fine all the way through.
2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it!
3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker.
4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command.
5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current open file.
6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it.
[Current directory: /workspace/{workspace_dir_name}]
"""
else:
# Testing general agents
instruction = (

@enyst
Copy link
Collaborator

enyst commented Sep 18, 2024

I wonder if it's better if we include the user message we want in the Executor ourselves, rather than nudge the LLM to include it. We know exactly the snippet we want, after all.

@ryanhoangt
Copy link
Contributor Author

Yeah that makes sense, I can try doing that in the next run

@ryanhoangt
Copy link
Contributor Author

Okay finally the score is converging to what we want, thanks @enyst for all the improvement suggestions! On the subset of 93 verified instances, CoAct resolved 33/93 while CodeAct resolved 39/93.

Some plots:

Comparing instances resolved in each category, seems like CoAct doesn't perform very well on easy level instances:

I'm gonna upload the trajectories to Huggingface shortly.

@@ -257,6 +265,10 @@ def _get_messages(self, state: State) -> list[Message]:
else:
raise ValueError(f'Unknown event type: {type(event)}')

if message and message.role == 'user' and not self.initial_task_str[0]:
# first user message
self.initial_task_str[0] = message.content[0].text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, do we still need this?

@enyst
Copy link
Collaborator

enyst commented Sep 26, 2024

Cheers! This is great news. ❤️

The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem.

@ryanhoangt
Copy link
Contributor Author

The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem.

The trajectory is uploaded to the visualizer here. I'm going to run evaluation on all 300 instances with the remote runtime to see how it goes, also clean up code a bit and fix tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent framework Strategies for prompting, agent, etc enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants