-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating MemGPT-like Functionality #2937
base: main
Are you sure you want to change the base?
Conversation
return self.action_parser.parse(response) | ||
|
||
def condense( | ||
self, | ||
state: State, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move this .condense
function to the LLM
class, you can make messages
as the input argument. We might need to make each Message
a Pydantic class with an additional attribute (e.g., condensable
). This may also benefit the ongoing effort that tries to add vision capability in #2848 (comment) (cc @Kaushikdkrikhanu)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done. But on moving .condense
to LLM
class, it won't be able to add summary to state.history
directly. I might also need to pass state
as arg to llm.completion
along with messages.
If we are trying to make it cleaner (#2937 (comment)), I instead was thinking of moving .condense
to MemoryCondenser
class, after making each Message
a pydantic class. That's where I think .condenser
should belong 😅.
WDYT? Which approach seems better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh good point... maybe caller to LLM
should make sure the summary got added to the State
- like what we are doing now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought: maybe State
is not the place to add these new summarizing actions - EventStream
should be: https://github.com/OpenDevin/OpenDevin/blob/c555fb68400bb117011cda3e5d3d95beb5169000/opendevin/controller/agent_controller.py#L395
Maybe you could add an optional argument for LLM
class, like condense_callback_fn
, which will be called with the condensed Summarized
action whenever a condensation is happening. And that callback will add this Summarize
action to the event stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this? How does adding summarizing actions to EventStream help?
MemGPT stores only the summary of events from the start to some event.id
. We only need to store one summary in the history. A summary event remains useful until we create a new summary, after which it becomes useless.
Btw, I've moved the .condenser
functions to LLM
class. MemoryCondenser
class used methods of LLM
class, so I completely removed it and moved one of its method to LLM
class. This also achieves the target (#2937 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, going forward, we hope to make EventStream
a single source of truth for all our past message / history (i.e., you get all the history messages from the event stream), so ideally all your condensation actions also should goes into event stream.
But base on my understanding, your current implementation already emit "AgentSummarizeAction"? I think we just need to make sure that action got stored inside eventstream, we should be good to go on this bullet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xingyaoww Oh interesting, I played with summaries in the event stream at some point. A couple of thoughts:
- are they 'real events'? That is, if emitted with the same mechanism as others (which they currently were not), would they be propagated downstream to components listening
- or, since they replace during runtime the real events they summarize, are they more like 'shadow events'? In which case they're not the truth, but a fleeting version of it, just necessary due to some runtime constraints; the source of truth holds the real events and is able to retrieve them when needed.
- they need to be saved I assume, even if it's one at a time, to retrieve at a disconnect or restart/restore etc. Do we save them along with other events or we save them in State? If we save them in the eventstream, then we may want to consider moving some filtering logic too.
On a side note, making it only one in a run simplifies the logic either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better overall! But i think there are still a few places that we can clean up :)
opendevin/condenser/condenser.py
Outdated
] | ||
) | ||
] | ||
+ message_sequence_to_summarize[cutoff:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to work well, we want to convert to strings all messages, but got:
[AgentSummarizeAction(summarized_actions='I asked the user about the task, provided a concise description of the Moon, and wrote a Bash script to solve the Three Queens problem.', summarized_observations='The user requested a concise description of the Moon and a Bash script for the Three Queens problem. The script execution encountered issues, resulting in exit codes -1 and 1.', action='summarize'), <opendevin.llm.messages.Message object at 0x17a962030>, <opendevin.llm.messages.Message object at 0x17a961ac0>, <opendevin.llm.messages.Message object at 0x17a962330>, <opendevin.llm.messages.Message object at 0x17a962000>, <opendevin.llm.messages.Message object at 0x17a961b80>, <opendevin.llm.messages.Message object at 0x17a961f10>, <opendevin.llm.messages.Message object at 0x17a962270>, <opendevin.llm.messages.Message object at 0x17a961e20>, <opendevin.llm.messages.Message object at 0x17a9621b0>, <opendevin.llm.messages.Message object at 0x17a961e80>, <opendevin.llm.messages.Message object at 0x17a961be0>, <opendevin.llm.messages.Message object at 0x17a961c40>, <opendevin.llm.messages.Message object at 0x17a961ee0>, <opendevin.llm.messages.Message object at 0x17a961e50>, <opendevin.llm.messages.Message object at 0x17a9615b0>, <opendevin.llm.messages.Message object at 0x17a961d60>, <opendevin.llm.messages.Message object at 0x17a9611f0>, <opendevin.llm.messages.Message object at 0x17a961f40>, <opendevin.llm.messages.Message object at 0x17a962e40>]
I think it's when it's called a second time recursively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it happens in the second recursion. I made some corrections.
When I am trying it out, I am not able to trigger this if statement. How are you doing it?😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a large context (11-12k) compared to the max_input_tokens=4000
value. I suspect that's why it works for me, so you may want to play with the agent, keep adding to context, then switch to a much lower setting. This 12k is triple the 4k limit. (So the 0.75 factor in use in this PR, even if we subtract the initial prompt, is left waaaay behind. 😅)
If it helps, here is my event stream. It's not great for testing in general TBH, but I suspect it's good enough for this issue simply because of that size difference.
eventstream_main.zip
You may want to use all of these settings, like filestore (to save locally on disk the events in the stream, it will create a main
sub-directory for it) and cli session enabled. Then replace the 'main' subdirectory in the filestore with the contents here.
I should note, I didn't use the UI. If you're using the UI, you can do the same, just need to set JWT token and use the sub-directory named with the right session-id, instead of "main" sub-directory. Frankly I've never tried to restore a cli session into a UI session, but it should work(tm). Most events are the same obviously, just the UI might have some extra INIT events, which shouldn't matter for restore.
|
||
candidate_messages_to_summarize = [] | ||
tokens_so_far = 0 | ||
for message in messages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We gave up using State, but then we still need to keep the user message that is likely to be a task. That is computed here: https://github.com/OpenDevin/OpenDevin/blob/84a6e90dc2e8b1e096af33f7545fa1969853c7d4/opendevin/controller/state/state.py#L169
Maybe we can set it as non-condensable in the agent, before we get here? Otherwise we're losing relevant information, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something needs to be done about user messages. It is currently losing relevant information. But only marking user messages as non-condensable is not sufficient ig? Some user messages might not contribute much to the relevant information about the final task.?
Currently summarize_messages()
has summarized_actions
& summarized_observations
as args. I was thinking, maybe adding one more arg, summarized_user_messages
might also help. This will summarize all the user messages till now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is currently losing relevant information. But only marking user messages as non-condensable is not sufficient ig? Some user messages might not contribute much to the relevant information about the final task.?
I suggest to not do it for all user messages, but the most likely task. The method linked above, get_current_user_intent()
, returns only one user message: what is likely to be the current task. If we set that one as non-condensable, it will already improve the behavior. For the SWE-bench, that method will work to return the initial user message, which is definitely the task itself.
For interactive user sessions, you're right it's more complicated. The method tries to guess what's the best user message, and returns the last one after a FinishAction, so it's the user message just after a task ended, and a new one started. That's a best-guess, of course, that the new task is in that message, but IMHO it will do for now. Do you think it's likely it won't work?
I was thinking, maybe adding one more arg, summarized_user_messages might also help. This will summarize all the user messages till now.
I see what you mean. In the previous version (PR 2021), we were summarizing the user messages in a summary of their own, as you suggest, when there were no more agent messages, so at a later stage, but again except the current user intent
. The current intent as detected (or attempted 😅) remained uncondensed as long as possible. Doing something like this might still be a good option, but I don't think it's necessary for this PR.
The important part, it seems to me, is that if we miss the initial user message, then we risk that the agent will not have enough information about the task to perform correctly.
@khushvind This is amazing work, thank you! We definitely need something like this, and it's a smart algo. I'm sorry I didn't get to look into it very well yet, but I did give it a try, and it seems there are a few issues. If you want, for replicating, I can give you the event stream I used, or, my scenario was something like: use a model with a relatively large context window (I used get-4o with 128k), make sure sessions are enabled[1], do stuff, then when the history gets to some 11k, stop it, config 4k in This will provoke it not only to apply summarize, but also do it consecutively before getting to a regular prompt, and it seems to behave unexpectedly in part, it loses information. Also, it doesn't seem to save the summary (my run ended with an error, but I'm not sure I see in the code where it saves them, even if it ends well?) I can play with it more tomorrow I hope. Thank you for this! [1] If you use opendevin via the UI, save |
Some thoughts, since we're at it: I think there are two major areas which have affected negatively performance in previous work.
FWIW in the swe-bench evals on PR 2021, I saw some tasks where the LLM was just restarting, as if it knew nothing. It wasn't using the summary! I think this was due to both prompt and recall. Since this PR is condensing 75% of the messages, leaving a few at the end, I think we won't see that anymore. Nevertheless, TBH if we still don't include recall, we do need to fix the prompt very well... I found some decent tasks on swe-bench for this, I'm going to retrieve them and run them on this PR. I think I'll also add some recall and see how it goes then. |
@khushvind Thanks for your amazing work -- Let me know if this is ready for review again! I have a couple of ideas regarding how we can develop this:
|
} | ||
} | ||
Make sure to include in observations any relevant information that the agent needs to remember. | ||
%(events)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a placeholder in the template string, for a key events
to replace it. The code can use it like
events_dict = {'events': 'the data in summary input'}
formatted_string = SUMMARY_PROMPT_SYSTEM % events_dict
We can do something similar here, or we can remove this placeholder.
@xingyaoww I am still to run some evaluations on swe-bench to check if it is working fine on different examples. And also need to incorporate some changes that @enyst suggested. I'll let you know once it's done. I'll take a look at these benchmarks by the weekend. |
The
MemGPT's prompt also just asks the LLM to summarize the messages without asking it specifically to keep the info about what files shouldn't be checked. Maybe we can refine ours. I already included some part of MemGPT's prompt, that seemed useful, in current implementation. |
Just to note, I played a little with it on some tasks on SWE-bench, and so far:
I'm traveling today and tomorrow, we'll see soon though.
Maybe, but afaik the version in PR 2021 was almost identical without a word limit, and it didn't seem enough on SWE-bench tasks (otherwise in interactive sessions it works, since the user will nudge it). Worth trying though.
That's an interestingly simple prompt! To clarify, the reason I say files is that on SWE bench, the LLM appeared to repeat the attempts to work with the wrong files like it had done before, after it figured out the right files. 🤔 Maybe it was a fluke... |
- "action": "summarize" | ||
- args: | ||
- "summarized_actions": A precise sentence summarizing all the provided actions, written in the first person. | ||
- "summarized_observations": A few precise sentences summarizing all the provided observations, written in the third person. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: what's the motivation behind "first person" for action and "third person" for observations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to do some work on improving the prompt. This was adapted from #2021. From what I felt, actions are performed by the agent, and it would make more sense to have "first person" for the action summary as this gets passed to the agent in the prompt. Similarly "third person" for observation.
I didn't pay much attention to it that time, but now that you've pointed it out, maybe we can avoid specifying the person in the new prompt, as this summary message has role = 'user'
when passed to the llm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition is it doesn't matter at all, thus the ask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeAct prompt is structured as a dialogue, one message from user, one from agent pattern. The former version was splitting this result in 2 messages sent to the LLM, following the same pattern. It went like this:
role='user'
...The task is ...
role='agent'
For this task, I did this and I did that and the other...
# action-like
role= 'user'
The code was ... and the assistant had....
# obs-like
In this PR, we're sending them both in the same role='user' message. So it should probably be all like an obs. It will also be merged into the user task, e.g. it will end up like:
role='user'
The task is...
The assistant did this and that...
The code was... and...
This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Fix #1748
This PR is proposing an implementation integrating MemGPT-like functionality to OpenDevin's ShortTermMemory inspired by issue #2487. It also uses ideas from #2021.
Give a brief summary of what the PR does, explaining any non-trivial design decisions
The memory is summarized (condensed) when the token count approaches the context window limit but still hasn't hit the limit. All the events in the event stream are considered to be summarizable, except the system and example messages. When the condenser function is called, it tries to summarize events such that at least the first 0.75 fractions of the total tokens get summarized. After this, the summary is added to history. Events are summarized in the order of their occurrence, i.e., initial events are summarized first. The word limit for the summary is set to <=200 by default.
Messages are passed to llm in this order → (system message + example message + summary + all events after the last summarized event)
Currently, this functionality is only implemented for CodeAct Agent.