-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(workflow): Implement a simplified CoAct workflow #3770
base: main
Are you sure you want to change the base?
Conversation
just fyi, the integration tests seem to fail because of some
in some results. |
Thanks, still waiting for reviews on it, if it is good to go I will look into the tests. |
2890cc5
to
fc44e17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks a bunch for this @ryanhoangt !
I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.
Thanks Prof., I'll do and update how it goes soon. |
It might be in the paper(s), but I don't quite like that the prompts now talk of |
@ryanhoangt @ketan1741
|
@enyst thanks for the quick fix and the insightful comments. Re the
Yeah I can also imagine some other weird cases that might come up, and we will need more work to make the agent (with delegation) robustly usable on the platform. Tbh this is also not a version that I expected initially (i.e., using the grouding test case to help the agent resolve the issue) and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps. Should I close this PR for now? Hi @jatinganhotra, regarding the full evaluation, I think it will likely further lower the score, as the remaining instances are not included in On All Hands's Slack we have a dedicated channel named "workflows" to discuss about this topic. We'd love to chat more about improvement ideas there. |
I also hope that understanding and debugging swe-bench issues will get easier after PR 3808 (simpler execution flow in the backend, fix delegates) and PR 3413 (logging prompts).
Sorry, that's not what I meant. Up to you, of course, but IMHO it's worth looking into it a bit more, fixing stuff around what happens. It is quite amazing that it solves 6 more than Two minor thoughts:
One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent or show that it doesn't affect its performance. Just my thought, not sure how others think about it? |
🤔 I guess it could be that it's the default serialization, so if they're not added at that point in CodeAct, maybe it takes it by default? Not fully sure, but anyway I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:
|
I attach a couple of files that are very strange to me. They're a few of the prompts sent to the LLM during django__django-11964. prompt_039.py is the first prompt to Planner after the Executor returns with success. It has an observation using JSON. The last lines:
There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt: prompt_040.py
Please correct me if I misunderstand how this should work. @ryanhoangt @ketan1741 |
Re the json in the visualizer, seems like it is because we don't format the finish action yet.
Good catch, this seems to be another bug. Might be because this action is not handled properly: return AgentFinishAction(thought=thought, outputs={'content': thought})
Yeah I also noticed this issue. My intention is to make the Planner include the full user message (hence the full problem statement in swe-bench) to give executor some more context, but sometimes it included the message from the few-shot examples, or the "Now, let's come up with 2 global plans sequentially." as you saw, which is problematic.
"let's come up with 2 global plans sequentially" - this is an extra piece of prompt used only in swe-bench evaluation for CoActPlanner. Similar to OpenHands/evaluation/swe_bench/run_infer.py Lines 248 to 290 in 41ddba8
|
I wonder if it's better if we include the user message we want in the Executor ourselves, rather than nudge the LLM to include it. We know exactly the snippet we want, after all. |
Yeah that makes sense, I can try doing that in the next run |
Okay finally the score is converging to what we want, thanks @enyst for all the improvement suggestions! On the subset of 93 verified instances, CoAct resolved 33/93 while CodeAct resolved 39/93. Some plots: Comparing instances resolved in each category, seems like CoAct doesn't perform very well on easy level instances: I'm gonna upload the trajectories to Huggingface shortly. |
@@ -257,6 +265,10 @@ def _get_messages(self, state: State) -> list[Message]: | |||
else: | |||
raise ValueError(f'Unknown event type: {type(event)}') | |||
|
|||
if message and message.role == 'user' and not self.initial_task_str[0]: | |||
# first user message | |||
self.initial_task_str[0] = message.content[0].text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, do we still need this?
Cheers! This is great news. ❤️ The reason I suggested we take a look at the default agent changes, was just to make sure that it doesn't change its normal behavior. Give or take some details that I'm guessing integration tests will be unhappy with, so we can see and fix them if so, I think it shouldn't be a problem. |
The trajectory is uploaded to the visualizer here. I'm going to run evaluation on all 300 instances with the remote runtime to see how it goes, also clean up code a bit and fix tests. |
Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG
CodeActAgent
overlooks the buggy location. If we have a grounding test case for the issue, this workflow seems to help.CoActPlannerAgent
finished butCodeActAgent
failed (I expected both to be able to complete it):Give a summary of what the PR does, explaining any non-trivial design decisions
Some next steps to improve this may be:
swe-bench-lite
instances.BrowsingAgent
orCoActExecutorAgent
) to persist its history through the multi-turn conversation.Link of any specific issues this addresses
#3077