feat(workflow): Implement a simplified CoAct workflow #3770

ryanhoangt · 2024-09-07T13:49:58Z

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

This PR implements a simplified multi-agent workflow inspired by the CoAct paper.
Currently, in swe-bench eval, there are complex instances that OpenHands fails, especially ones that single CodeActAgent overlooks the buggy location. If we have a grounding test case for the issue, this workflow seems to help.
An overkill-ish successful trajectory with replanning can be found here.
A task which CoActPlannerAgent finished but CodeActAgent failed (I expected both to be able to complete it):
- CoAct traj
- CodeAct traj

Give a summary of what the PR does, explaining any non-trivial design decisions

Modify CodeAct to make it accept delegated task.
Implement 2 new agents, planner and executor with the same abilities as CodeAct, different system prompts, additional action parsers.
Nit: adjust the delegate message shown on UI.

Some next steps to improve this may be:

Try eval on some swe-bench-lite instances.
Adjust the system/user prompts and few-shot examples to further specialize the two agents. Also define the structure for the plan (e.g., its components, etc). 2 agents can now cooperate to finish a swe-bench issue.
Use meta prompt to reinforce the actions of agents, to make sure it follows the workflow.
Experiment with ability for the global planner to refuse the replan request from executor.
Implement ability for the delegated agent (e.g., BrowsingAgent or CoActExecutorAgent) to persist its history through the multi-turn conversation.

Link of any specific issues this addresses

#3077

…nd UI

agenthub/coact_agent/meta/executor_prompt.py

openhands/events/observation/browse.py

tobitege · 2024-09-07T18:06:52Z

just fyi, the integration tests seem to fail because of some

"action_suffix": "browse"

in some results.

ryanhoangt · 2024-09-07T18:16:56Z

just fyi, the integration tests seem to fail because of some
"action_suffix": "browse"
in some results.

Thanks, still waiting for reviews on it, if it is good to go I will look into the tests.

… few-shot examples

neubig

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

ryanhoangt · 2024-09-10T02:44:25Z

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

Thanks Prof., I'll do and update how it goes soon.

agenthub/coact_agent/executor/system_prompt.j2

tobitege · 2024-09-10T17:57:19Z

It might be in the paper(s), but I don't quite like that the prompts now talk of agent, while anywhere else it is assistant. 🤔

enyst · 2024-09-16T21:42:14Z

@ryanhoangt @ketan1741
I'm debugging this a bit, and I don't know yet enough answers we need, but just so you know:

The PR is merged. Please see here for delegates trajectories on a simple test: HuggingFace or attached CoActPlannerAgent.zip
IMHO django__django-10914 looks like a fluke. It's funny... the agent has to find FILE_UPLOAD_PERMISSIONS, in the task FILE_UPLOAD_PERMISSION is used interchangeably with FILE_UPLOAD_PERMISSIONS, and the agent read the right file, but 1) on main it concluded that it has FILE_UPLOAD_PERMISSIONS and modifies it, 2) on branch it concluded it cannot find FILE_UPLOAD_PERMISSION so it has to add it and fails the task.
In the example test above it succeeded (repeatedly) on the PR branch. 🤷
In part, we are making CodeAct a delegate and it's not yet ready for it. 😅 For example, a thing that happens is that it's using get_user_message() here which has to return a MessageAction. But a delegate doesn't have any MessageAction, and the method has no protection for delegates (it doesn't stop at start_id or AgentDelegateAction) so it's hijacking a MessageAction from the parent agent. This shouldn't happen (fix coming up).
It doesn't influence anything else, but there might be other issues of this kind...?
FWIW other delegates use something like this.

ryanhoangt · 2024-09-17T08:55:23Z

@enyst thanks for the quick fix and the insightful comments.

Re the django__django-10914 instance, it's interesting to know there're some unclear specs here, I don't know which behavior I should advocate 😅. But I think there're not much we can do, it's more to do with the LLM than the agent implementation.

we are making CodeAct a delegate and it's not yet ready for it.

Yeah I can also imagine some other weird cases that might come up, and we will need more work to make the agent (with delegation) robustly usable on the platform.

Tbh this is also not a version that I expected initially (i.e., using the grouding test case to help the agent resolve the issue) and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps. Should I close this PR for now?

Hi @jatinganhotra, regarding the full evaluation, I think it will likely further lower the score, as the remaining instances are not included in swe-bench-verified. IMO It would be more beneficial to run the full eval after validating the agent gets good scores on this "verified" subset.

On All Hands's Slack we have a dedicated channel named "workflows" to discuss about this topic. We'd love to chat more about improvement ideas there.

enyst · 2024-09-17T14:38:39Z

Tbh this is also not a version that I expected initially and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps.

I also hope that understanding and debugging swe-bench issues will get easier after PR 3808 (simpler execution flow in the backend, fix delegates) and PR 3413 (logging prompts).

Should I close this PR for now?

Sorry, that's not what I meant. Up to you, of course, but IMHO it's worth looking into it a bit more, fixing stuff around what happens. It is quite amazing that it solves 6 more than main, but for some reason we still don't know it breaks in 9. FWIW the 2 tests I worked with were from those 9 that pass on main but failed on branch, and they passed every time.

Two minor thoughts:

the PR sends the new actions as serialized in json, but CodeAct doesn't do that usually, it sends strings as in a dialogue with the llm. It may be less confusing to the LLM if we are consistent and add the new ones around here in a similar format, extracting the useful info. Small note, Claude doesn't do as well with JSON as GPT-4o in my experience, not because it can't but it seems to need some prompting or gets confused at times.
what do you think, what if we give Claude the prompt pieces, and ask it to rephrase for clarity, precision, and all info needed?

One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent or show that it doesn't affect its performance. Just my thought, not sure how others think about it?

ryanhoangt · 2024-09-17T15:56:54Z

Sounds good, I can do the debugging in the next few days and try running a new eval to obtain the trajectory of the executor. In previous eval, bugs such as missing the </execute...> tag in the stop sequences also negatively affected the performance, as it made the planner to execute the plan itself. Also happy to hear other ideas on improving this!

Also just to clarify a bit, the 6/9 diff above are for instances at medium (15m-1h) level. There's also a diagram for instances at easy (<15m) level (the 2 instances you ran is in the red subset)

I looked into the trajectory in this easy level, one observation that make CoAct failed on many instances is that the planner very often came up with plans that have more steps than needed (often the plan contained 3 phases but only 1 of them was actually required to resolve the issue).

Some questions:

the PR sends the new actions as serialized in json

Can you clarify this? I'm not sure I did any json serialization in the PR...

One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent

I think there're no changes to CodeAct execution in this PR, except the message property of AgentDelegateAction (which CodeAct should not use in swe-bench). Can you point out some changes that you're concerned?

what do you think, what if we give Claude the prompt pieces, and ask it to rephrase for clarity, precision, and all info needed?

Can you give an example of this?

enyst · 2024-09-17T16:26:54Z

Can you clarify this? I'm not sure I did any json serialization in the PR...

🤔 I guess it could be that it's the default serialization, so if they're not added at that point in CodeAct, maybe it takes it by default? Not fully sure, but anyway I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Action
{
"id":20
"timestamp":"2024-09-16T19:46:29.172006"
"source":"agent"
"message":"The new test method has been added successfully. Now all three phases of the task are complete. Let's summarize what we've done:

1. Modified the FILE_UPLOAD_PERMISSIONS setting in global_settings.py to 0o644.
....

enyst · 2024-09-17T21:11:27Z

I attach a couple of files that are very strange to me. They're a few of the prompts sent to the LLM during django__django-11964.
prompts.zip

prompt_039.py is the first prompt to Planner after the Executor returns with success. It has an observation using JSON. The last lines:

...
</expected_content>
</execute_global_plan>

----------

OBSERVATION:
{'content': "Thank you for the update. The changes for all three phases have been successfully implemented. Let's review the modifications:\n\n1. We added the `__str__` method to the `TextChoices` class.\n2. We added the `__str__` method to the `IntegerChoices` class.\n3. We added the `__str__` method to the base `Choices` class.\n\nThese changes should address the issue of differing types for TextChoices and IntegerChoices fields. The `__str__` method in each class now ensures that the string representation of the enum value is returned.\n\nTo summarize the changes made:\n\n1. In the `Choices` base class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\n2. In the `IntegerChoices` class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\n3. In the `TextChoices` class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\nThese modifications should resolve the issue of differing types for TextChoices and IntegerChoices fields. The value returned by the getter of the field will now be of the same type as the one created by enum.Enum (enum value).\n\nThe task is now complete."}


ENVIRONMENT REMINDER: You have 21 turns left to complete the task. When finished reply with <finish></finish>.

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt:

prompt_040.py

You are an autonomous intelligent programming agent playing the role of a subordinate employee...
...
NOW, LET'S START!
The user message is: "Now, let's come up with 2 global plans sequentially.
- First, examine the codebase and locate the relevant code for the issue. Then we'll come up with the FIRST detailed plan with all the edits to resolve it.
- After the local agent finishes executing the first plan, navigate the codebase again and come up with the SECOND detailed plan to create exactly ONE unit test at the correct location to verify the change has actually resolved the issue. As the LAST phase, explicitly tell the executor to execute it after creating that test case. If the test failed and after debugging it the local executor believes the previous fixes are incorrect, request for a new plan and include the error with explanation for that request."

# Phases
## Phase 1
- description: Create a new test file for enum-related tests.
 ...
## Phase 3
- description: Execute the newly created test case.
- reason: We need to verify that our changes have resolved the issue and that the test passes.
- expected_state: The test case is executed and passes successfully.


ENVIRONMENT REMINDER: You have 20 turns left to complete the task. When finished reply with <finish></finish>.

Please correct me if I misunderstand how this should work. @ryanhoangt @ketan1741
I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

ryanhoangt · 2024-09-18T06:24:56Z

I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Re the json in the visualizer, seems like it is because we don't format the finish action yet.

prompt_039.log - It has an observation using JSON.

Good catch, this seems to be another bug. Might be because this action is not handled properly:

return AgentFinishAction(thought=thought, outputs={'content': thought})

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt

Yeah I also noticed this issue. My intention is to make the Planner include the full user message (hence the full problem statement in swe-bench) to give executor some more context, but sometimes it included the message from the few-shot examples, or the "Now, let's come up with 2 global plans sequentially." as you saw, which is problematic.

I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

"let's come up with 2 global plans sequentially" - this is an extra piece of prompt used only in swe-bench evaluation for CoActPlanner. Similar to CodeActSWEAgent below, it can be used to steer the agent a bit to be better at a specific task, but I'm not sure the current "2 global plans" is the optimal way to go. In CodeActAgent there're many cases where the agent just fixed the issues without creating any tests.

OpenHands/evaluation/swe_bench/run_infer.py

Lines 248 to 290 in 41ddba8

    
               if agent_class == 'CodeActSWEAgent': 
        
                   instruction = ( 
        
                       'We are currently solving the following issue within our repository. Here is the issue text:\n' 
        
                       '--- BEGIN ISSUE ---\n' 
        
                       f'{instance.problem_statement}\n' 
        
                       '--- END ISSUE ---\n\n' 
        
                   ) 
        
                   if USE_HINT_TEXT and instance.hints_text: 
        
                       instruction += ( 
        
                           f'--- BEGIN HINTS ---\n{instance.hints_text}\n--- END HINTS ---\n' 
        
                       ) 
        
                   instruction += f"""Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want. 
        
           Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command. 
        
           When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>. 
        
           Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`. 
        
           NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line! 
        
           IMPORTANT TIPS: 
        
           1. Always start by trying to replicate the bug that the issues discusses. 
        
               If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug. 
        
               Then start trying to fix it. 
        
               When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed. 
        
               If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file, 
        
               so that you can be sure that the script indeed ran fine all the way through. 
        
           2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it! 
        
           3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker. 
        
           4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command. 
        
           5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current  open file. 
        
           6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it. 
        
           [Current directory: /workspace/{workspace_dir_name}] 
        
           """ 
        
               else: 
        
                   # Testing general agents 
        
                   instruction = (

enyst · 2024-09-18T14:14:06Z

I wonder if it's better if we include the user message we want in the Executor ourselves, rather than nudge the LLM to include it. We know exactly the snippet we want, after all.

ryanhoangt · 2024-09-18T14:19:56Z

Yeah that makes sense, I can try doing that in the next run

ryanhoangt · 2024-09-26T02:34:37Z

Okay finally the score is converging to what we want, thanks @enyst for all the improvement suggestions! On the subset of 93 verified instances, CoAct resolved 33/93 while CodeAct resolved 39/93.

Some plots:

Comparing instances resolved in each category, seems like CoAct doesn't perform very well on easy level instances:

I'm gonna upload the trajectories to Huggingface shortly.

agenthub/codeact_agent/codeact_agent.py

enyst · 2024-09-26T06:18:39Z

agenthub/codeact_agent/codeact_agent.py

@@ -257,6 +265,10 @@ def _get_messages(self, state: State) -> list[Message]:
            else:
                raise ValueError(f'Unknown event type: {type(event)}')

+            if message and message.role == 'user' and not self.initial_task_str[0]:
+                # first user message
+                self.initial_task_str[0] = message.content[0].text


Just wondering, do we still need this?

enyst · 2024-09-26T06:24:01Z

Cheers! This is great news. ❤️

The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem.

ryanhoangt · 2024-09-28T04:31:36Z

The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem.

The trajectory is uploaded to the visualizer here. I'm going to run evaluation on all 300 instances with the remote runtime to see how it goes, also clean up code a bit and fix tests.

ryanhoangt added 13 commits September 5, 2024 08:55

make codeact accept delegated task

48ebc6e

implement a draft planner agent that can delegate task to CodeAct

136f10a

add executor agent and refactor planner agent

6ebcf02

experiment: adjust prompt to make BrowsingAgent better potentially?

9b4430d

modify prompt and action parser for executor agent

135f26d

fix bug agent using wrong prompt files, improve logging on terminal a…

9da4217

…nd UI

revert debugging in browsing_agent.py

a963080

revert BrowsingAgent prompt

25d7a74

fix: local agent should finish instead of delegate

56edccb

Merge branch 'main' into coact-workflow-draft

cd660c0

update prompt of planner agent

567297c

add meta prompt to use in the future to reinforce workflow

2264e70

update README

136da2b

enyst reviewed Sep 7, 2024

View reviewed changes

agenthub/coact_agent/meta/executor_prompt.py Outdated Show resolved Hide resolved

enyst reviewed Sep 7, 2024

View reviewed changes

openhands/events/observation/browse.py Outdated Show resolved Hide resolved

ryanhoangt added 2 commits September 7, 2024 16:47

revert change in browser obs

d5b5831

fix copy paste error

c864237

neubig self-requested a review September 8, 2024 13:32

ryanhoangt added 2 commits September 8, 2024 16:19

update run_infer and README

39d8d0a

specialize global planner agent by removing editing skills and update…

fc44e17

… few-shot examples

ryanhoangt force-pushed the coact-workflow-draft branch from 2890cc5 to fc44e17 Compare September 9, 2024 12:23

Merge branch 'main' into coact-workflow-draft

5cadc38

neubig reviewed Sep 9, 2024

View reviewed changes

neubig mentioned this pull request Sep 10, 2024

Ketan simplified co act workflow #3793

Draft

improve planner agent's prompt

af0b3d7

enyst reviewed Sep 10, 2024

View reviewed changes

agenthub/coact_agent/executor/system_prompt.j2 Outdated Show resolved Hide resolved

ryanhoangt added 8 commits September 20, 2024 07:00

modify prompt location

dbc5b9d

Merge branch 'main' into coact-workflow-draft

942ff4e

add ability to inject problem statement into executor

10cc53b

Merge branch 'main' into coact-workflow-draft

9a7c5a9

fix bug in injecting initial user message

038313a

place 'reason'beefore 'description' in plan structure

34cabef

update executor's system prompt

c5303f2

final touch

f962fd6

enyst reviewed Sep 26, 2024

View reviewed changes

agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved

enyst reviewed Sep 26, 2024

View reviewed changes

Merge branch 'main' into coact-workflow-draft

7b4a2da

ryanhoangt added 7 commits September 28, 2024 06:25

fix unit tests

05cc97f

revert change in AgentDelegateAction

8b4eb1b

update mocks for intg tests

481a3e4

fix bug in CodeActAgent

ff90ff4

Merge branch 'main' into coact-workflow-draft

7b285c2

Merge branch 'main' into coact-workflow-draft

5e70aca

Merge branch 'main' into coact-workflow-draft

0823c1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workflow): Implement a simplified CoAct workflow #3770

feat(workflow): Implement a simplified CoAct workflow #3770

ryanhoangt commented Sep 7, 2024 •

edited

Loading

tobitege commented Sep 7, 2024

ryanhoangt commented Sep 7, 2024 •

edited

Loading

neubig left a comment

ryanhoangt commented Sep 10, 2024

tobitege commented Sep 10, 2024

enyst commented Sep 16, 2024 •

edited

Loading

ryanhoangt commented Sep 17, 2024 •

edited

Loading

enyst commented Sep 17, 2024

ryanhoangt commented Sep 17, 2024 •

edited

Loading

enyst commented Sep 17, 2024

enyst commented Sep 17, 2024

ryanhoangt commented Sep 18, 2024 •

edited

Loading

enyst commented Sep 18, 2024

ryanhoangt commented Sep 18, 2024

ryanhoangt commented Sep 26, 2024

enyst Sep 26, 2024

enyst commented Sep 26, 2024

ryanhoangt commented Sep 28, 2024

feat(workflow): Implement a simplified CoAct workflow #3770

Are you sure you want to change the base?

feat(workflow): Implement a simplified CoAct workflow #3770

Conversation

ryanhoangt commented Sep 7, 2024 • edited Loading

tobitege commented Sep 7, 2024

ryanhoangt commented Sep 7, 2024 • edited Loading

neubig left a comment

Choose a reason for hiding this comment

ryanhoangt commented Sep 10, 2024

tobitege commented Sep 10, 2024

enyst commented Sep 16, 2024 • edited Loading

ryanhoangt commented Sep 17, 2024 • edited Loading

enyst commented Sep 17, 2024

ryanhoangt commented Sep 17, 2024 • edited Loading

enyst commented Sep 17, 2024

enyst commented Sep 17, 2024

ryanhoangt commented Sep 18, 2024 • edited Loading

enyst commented Sep 18, 2024

ryanhoangt commented Sep 18, 2024

ryanhoangt commented Sep 26, 2024

enyst Sep 26, 2024

Choose a reason for hiding this comment

enyst commented Sep 26, 2024

ryanhoangt commented Sep 28, 2024

ryanhoangt commented Sep 7, 2024 •

edited

Loading

ryanhoangt commented Sep 7, 2024 •

edited

Loading

enyst commented Sep 16, 2024 •

edited

Loading

ryanhoangt commented Sep 17, 2024 •

edited

Loading

ryanhoangt commented Sep 17, 2024 •

edited

Loading

ryanhoangt commented Sep 18, 2024 •

edited

Loading