Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating MemGPT-like Functionality #2937

Draft
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

khushvind
Copy link

@khushvind khushvind commented Jul 15, 2024

What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Fix #1748

This PR is proposing an implementation integrating MemGPT-like functionality to OpenDevin's ShortTermMemory inspired by issue #2487. It also uses ideas from #2021.

Give a brief summary of what the PR does, explaining any non-trivial design decisions
The memory is summarized (condensed) when the token count approaches the context window limit but still hasn't hit the limit. All the events in the event stream are considered to be summarizable, except the system and example messages. When the condenser function is called, it tries to summarize events such that at least the first 0.75 fractions of the total tokens get summarized. After this, the summary is added to history. Events are summarized in the order of their occurrence, i.e., initial events are summarized first. The word limit for the summary is set to <=200 by default.

Messages are passed to llm in this order → (system message + example message + summary + all events after the last summarized event)

Currently, this functionality is only implemented for CodeAct Agent.

Makefile Outdated Show resolved Hide resolved
agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved
agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved
agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved
agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved
agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved
return self.action_parser.parse(response)

def condense(
self,
state: State,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move this .condense function to the LLM class, you can make messages as the input argument. We might need to make each Message a Pydantic class with an additional attribute (e.g., condensable). This may also benefit the ongoing effort that tries to add vision capability in #2848 (comment) (cc @Kaushikdkrikhanu)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done. But on moving .condense to LLM class, it won't be able to add summary to state.history directly. I might also need to pass state as arg to llm.completion along with messages.

If we are trying to make it cleaner (#2937 (comment)), I instead was thinking of moving .condense to MemoryCondenser class, after making each Message a pydantic class. That's where I think .condenser should belong 😅.
WDYT? Which approach seems better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh good point... maybe caller to LLM should make sure the summary got added to the State - like what we are doing now?

Copy link
Contributor

@xingyaoww xingyaoww Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought: maybe State is not the place to add these new summarizing actions - EventStream should be: https://github.com/OpenDevin/OpenDevin/blob/c555fb68400bb117011cda3e5d3d95beb5169000/opendevin/controller/agent_controller.py#L395

Maybe you could add an optional argument for LLM class, like condense_callback_fn, which will be called with the condensed Summarized action whenever a condensation is happening. And that callback will add this Summarize action to the event stream?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? How does adding summarizing actions to EventStream help?
MemGPT stores only the summary of events from the start to some event.id. We only need to store one summary in the history. A summary event remains useful until we create a new summary, after which it becomes useless.

Btw, I've moved the .condenser functions to LLM class. MemoryCondenser class used methods of LLM class, so I completely removed it and moved one of its method to LLM class. This also achieves the target (#2937 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, going forward, we hope to make EventStream a single source of truth for all our past message / history (i.e., you get all the history messages from the event stream), so ideally all your condensation actions also should goes into event stream.

But base on my understanding, your current implementation already emit "AgentSummarizeAction"? I think we just need to make sure that action got stored inside eventstream, we should be good to go on this bullet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xingyaoww Oh interesting, I played with summaries in the event stream at some point. A couple of thoughts:

  • are they 'real events'? That is, if emitted with the same mechanism as others (which they currently were not), would they be propagated downstream to components listening
  • or, since they replace during runtime the real events they summarize, are they more like 'shadow events'? In which case they're not the truth, but a fleeting version of it, just necessary due to some runtime constraints; the source of truth holds the real events and is able to retrieve them when needed.

Ref: https://opendevin.slack.com/archives/C06QKSD9UBA/p1716737811950479?thread_ts=1716668475.888269&channel=C06QKSD9UBA&message_ts=1716737811.950479

  • they need to be saved I assume, even if it's one at a time, to retrieve at a disconnect or restart/restore etc. Do we save them along with other events or we save them in State? If we save them in the eventstream, then we may want to consider moving some filtering logic too.

On a side note, making it only one in a run simplifies the logic either way.

agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved
agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved
Copy link
Contributor

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better overall! But i think there are still a few places that we can clean up :)

opendevin/events/action/agent.py Outdated Show resolved Hide resolved
opendevin/llm/condenser.py Outdated Show resolved Hide resolved
opendevin/llm/llm.py Outdated Show resolved Hide resolved
opendevin/llm/llm.py Outdated Show resolved Hide resolved
agenthub/codeact_agent/codeact_agent.py Show resolved Hide resolved
agenthub/codeact_agent/codeact_agent.py Show resolved Hide resolved
]
)
]
+ message_sequence_to_summarize[cutoff:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to work well, we want to convert to strings all messages, but got:

[AgentSummarizeAction(summarized_actions='I asked the user about the task, provided a concise description of the Moon, and wrote a Bash script to solve the Three Queens problem.', summarized_observations='The user requested a concise description of the Moon and a Bash script for the Three Queens problem. The script execution encountered issues, resulting in exit codes -1 and 1.', action='summarize'), <opendevin.llm.messages.Message object at 0x17a962030>, <opendevin.llm.messages.Message object at 0x17a961ac0>, <opendevin.llm.messages.Message object at 0x17a962330>, <opendevin.llm.messages.Message object at 0x17a962000>, <opendevin.llm.messages.Message object at 0x17a961b80>, <opendevin.llm.messages.Message object at 0x17a961f10>, <opendevin.llm.messages.Message object at 0x17a962270>, <opendevin.llm.messages.Message object at 0x17a961e20>, <opendevin.llm.messages.Message object at 0x17a9621b0>, <opendevin.llm.messages.Message object at 0x17a961e80>, <opendevin.llm.messages.Message object at 0x17a961be0>, <opendevin.llm.messages.Message object at 0x17a961c40>, <opendevin.llm.messages.Message object at 0x17a961ee0>, <opendevin.llm.messages.Message object at 0x17a961e50>, <opendevin.llm.messages.Message object at 0x17a9615b0>, <opendevin.llm.messages.Message object at 0x17a961d60>, <opendevin.llm.messages.Message object at 0x17a9611f0>, <opendevin.llm.messages.Message object at 0x17a961f40>, <opendevin.llm.messages.Message object at 0x17a962e40>]

I think it's when it's called a second time recursively.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it happens in the second recursion. I made some corrections.
When I am trying it out, I am not able to trigger this if statement. How are you doing it?😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a large context (11-12k) compared to the max_input_tokens=4000 value. I suspect that's why it works for me, so you may want to play with the agent, keep adding to context, then switch to a much lower setting. This 12k is triple the 4k limit. (So the 0.75 factor in use in this PR, even if we subtract the initial prompt, is left waaaay behind. 😅)

If it helps, here is my event stream. It's not great for testing in general TBH, but I suspect it's good enough for this issue simply because of that size difference.
eventstream_main.zip

You may want to use all of these settings, like filestore (to save locally on disk the events in the stream, it will create a main sub-directory for it) and cli session enabled. Then replace the 'main' subdirectory in the filestore with the contents here.

I should note, I didn't use the UI. If you're using the UI, you can do the same, just need to set JWT token and use the sub-directory named with the right session-id, instead of "main" sub-directory. Frankly I've never tried to restore a cli session into a UI session, but it should work(tm). Most events are the same obviously, just the UI might have some extra INIT events, which shouldn't matter for restore.


candidate_messages_to_summarize = []
tokens_so_far = 0
for message in messages:
Copy link
Collaborator

@enyst enyst Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We gave up using State, but then we still need to keep the user message that is likely to be a task. That is computed here: https://github.com/OpenDevin/OpenDevin/blob/84a6e90dc2e8b1e096af33f7545fa1969853c7d4/opendevin/controller/state/state.py#L169

Maybe we can set it as non-condensable in the agent, before we get here? Otherwise we're losing relevant information, I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something needs to be done about user messages. It is currently losing relevant information. But only marking user messages as non-condensable is not sufficient ig? Some user messages might not contribute much to the relevant information about the final task.?

Currently summarize_messages() has summarized_actions & summarized_observations as args. I was thinking, maybe adding one more arg, summarized_user_messages might also help. This will summarize all the user messages till now.

Copy link
Collaborator

@enyst enyst Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is currently losing relevant information. But only marking user messages as non-condensable is not sufficient ig? Some user messages might not contribute much to the relevant information about the final task.?

I suggest to not do it for all user messages, but the most likely task. The method linked above, get_current_user_intent(), returns only one user message: what is likely to be the current task. If we set that one as non-condensable, it will already improve the behavior. For the SWE-bench, that method will work to return the initial user message, which is definitely the task itself.

For interactive user sessions, you're right it's more complicated. The method tries to guess what's the best user message, and returns the last one after a FinishAction, so it's the user message just after a task ended, and a new one started. That's a best-guess, of course, that the new task is in that message, but IMHO it will do for now. Do you think it's likely it won't work?

I was thinking, maybe adding one more arg, summarized_user_messages might also help. This will summarize all the user messages till now.

I see what you mean. In the previous version (PR 2021), we were summarizing the user messages in a summary of their own, as you suggest, when there were no more agent messages, so at a later stage, but again except the current user intent. The current intent as detected (or attempted 😅) remained uncondensed as long as possible. Doing something like this might still be a good option, but I don't think it's necessary for this PR.

The important part, it seems to me, is that if we miss the initial user message, then we risk that the agent will not have enough information about the task to perform correctly.

@enyst
Copy link
Collaborator

enyst commented Jul 30, 2024

@khushvind This is amazing work, thank you! We definitely need something like this, and it's a smart algo.

I'm sorry I didn't get to look into it very well yet, but I did give it a try, and it seems there are a few issues. If you want, for replicating, I can give you the event stream I used, or, my scenario was something like: use a model with a relatively large context window (I used get-4o with 128k), make sure sessions are enabled[1], do stuff, then when the history gets to some 11k, stop it, config 4k in max_input_tokens, then restart. (or just change max_input_tokens to 4k in the debugger)

This will provoke it not only to apply summarize, but also do it consecutively before getting to a regular prompt, and it seems to behave unexpectedly in part, it loses information. Also, it doesn't seem to save the summary (my run ended with an error, but I'm not sure I see in the code where it saves them, even if it ends well?)

I can play with it more tomorrow I hope. Thank you for this!

[1] If you use opendevin via the UI, save JWT_TOKEN=something in env, it should do; just in case, enable file_store="local" with some file_store_path too in config.toml. If you use it with cli, enable_cli_sessions=true. I find that enabling/restoring sessions is a huge time saver for this kind of stuff, for... reasons. 😅

@enyst
Copy link
Collaborator

enyst commented Jul 30, 2024

Some thoughts, since we're at it: I think there are two major areas which have affected negatively performance in previous work.

  • the prompt. It didn't always result in detailed enough information to guide the LLM for the next steps. This PR is using mostly the old prompt from PR 2021, and I'm hoping we can figure out something better. At least, I think we need it to give clear info to the LLM about what files it shouldn't check anymore, and what files it should continue working on. I don't remember if this exact version was doing any of those reliably, but anyway, I wonder if we could adapt something more of memGPT here? I'd favor completely ripping it up and replace, if we can.
  • missing Recall action. This PR, like 2021, never retrieves the original events, thus must repeat them if it needs them. Now, this behavior in PR 2021 happened because that PR had gotten too large, I had to stop somewhere, and left recall/longterm memory for follow-up PRs. It might not have been the best idea ever, because this feature is incomplete without it. Also memGPT obviously relies on retrievals all the time (99% of all its memory work is recalling originals). That's totally as it should be, and we do plan to do that, but the question is: is it possible to be good enough without it? Cc: @xingyaoww

FWIW in the swe-bench evals on PR 2021, I saw some tasks where the LLM was just restarting, as if it knew nothing. It wasn't using the summary! I think this was due to both prompt and recall. Since this PR is condensing 75% of the messages, leaving a few at the end, I think we won't see that anymore. Nevertheless, TBH if we still don't include recall, we do need to fix the prompt very well...

I found some decent tasks on swe-bench for this, I'm going to retrieve them and run them on this PR. I think I'll also add some recall and see how it goes then.

@khushvind khushvind marked this pull request as draft July 30, 2024 18:11
@xingyaoww
Copy link
Contributor

@khushvind Thanks for your amazing work -- Let me know if this is ready for review again!

I have a couple of ideas regarding how we can develop this:

  1. We should develop a benchmark / a way to evaluate a "memory condensation algo quantitatively."

    • it could be something like a repurposed version of Needle in a haystack For example, the input is a very long context prompt (128k tokens) with an instruction of "summarize everything, with a goal of XXX (that is related to the needle)"
    • re-use benchmarks from MemGPT https://arxiv.org/pdf/2310.08560 to make sure we are getting comparable QA performance (using our condensation algo)
    • Running SWE-Bench with condensation enabled (sort of an ultimate metrics); I think getting the first two for eval is probably easier
  2. I think this PR is probably ready to merge (but not enabled by default) after we can get comparable performance to MemGPT using benchmarks in their paper

  3. Once we achieve comparable and better performance with condensation on the SWE-Bench, I think we can enable it by default.

}
}
Make sure to include in observations any relevant information that the agent needs to remember.
%(events)s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a placeholder in the template string, for a key events to replace it. The code can use it like

events_dict = {'events': 'the data in summary input'}
formatted_string = SUMMARY_PROMPT_SYSTEM % events_dict

For example https://github.com/OpenDevin/OpenDevin/blob/21ea9953b3026223f65a04cf5699d6ad0915bfa7/agenthub/planner_agent/prompt.py#L167

We can do something similar here, or we can remove this placeholder.

@khushvind
Copy link
Author

@xingyaoww I am still to run some evaluations on swe-bench to check if it is working fine on different examples. And also need to incorporate some changes that @enyst suggested. I'll let you know once it's done.
I am not able to run evaluations because, for some reason, my WSL started crashing yesterday😅. ig I'll have to do the set-up again, which might take some time.

I'll take a look at these benchmarks by the weekend.

@khushvind
Copy link
Author

@enyst

It didn't always result in detailed enough information to guide the LLM for the next steps.

The WORD_LIMIT might be the reason for it. It's currently set to 200, and it seems it is not enough to convey enough information in the summary after a lot of events. I need to check this.

This PR is using mostly the old prompt from PR 2021, and I'm hoping we can figure out something better. At least, I think we need it to give clear info to the LLM about what files it shouldn't check anymore, and what files it should continue working on. I don't remember if this exact version was doing any of those reliably, but anyway, I wonder if we could adapt something more of memGPT here? I'd favor completely ripping it up and replace, if we can.

MemGPT's prompt also just asks the LLM to summarize the messages without asking it specifically to keep the info about what files shouldn't be checked. Maybe we can refine ours. I already included some part of MemGPT's prompt, that seemed useful, in current implementation.

@enyst
Copy link
Collaborator

enyst commented Aug 1, 2024

@xingyaoww I am still to run some evaluations on swe-bench to check if it is working fine on different examples. And also need to incorporate some changes that @enyst suggested. I'll let you know once it's done. I am not able to run evaluations because, for some reason, my WSL started crashing yesterday😅. ig I'll have to do the set-up again, which might take some time.

I'll take a look at these benchmarks by the weekend.

Just to note, I played a little with it on some tasks on SWE-bench, and so far:

  • as we already know, it needs some stuff for the task and summary (on gpt-4o, which is trying to work)
  • more surprisingly, Claude doesn't work with our current prompt, it refuses to summarize and continues the task instead. (which is a little rude, if you ask me 🫢 😅)

I'm traveling today and tomorrow, we'll see soon though.

The WORD_LIMIT might be the reason for it. It's currently set to 200, and it seems it is not enough to convey enough information in the summary after a lot of events. I need to check this.

Maybe, but afaik the version in PR 2021 was almost identical without a word limit, and it didn't seem enough on SWE-bench tasks (otherwise in interactive sessions it works, since the user will nudge it). Worth trying though.

MemGPT's prompt also just asks the LLM to summarize the messages without asking it specifically to keep the info about what files shouldn't be checked. Maybe we can refine ours. I already included some part of MemGPT's prompt, that seemed useful, in current implementation.

That's an interestingly simple prompt! To clarify, the reason I say files is that on SWE bench, the LLM appeared to repeat the attempts to work with the wrong files like it had done before, after it figured out the right files. 🤔 Maybe it was a fluke...

- "action": "summarize"
- args:
- "summarized_actions": A precise sentence summarizing all the provided actions, written in the first person.
- "summarized_observations": A few precise sentences summarizing all the provided observations, written in the third person.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: what's the motivation behind "first person" for action and "third person" for observations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to do some work on improving the prompt. This was adapted from #2021. From what I felt, actions are performed by the agent, and it would make more sense to have "first person" for the action summary as this gets passed to the agent in the prompt. Similarly "third person" for observation.
I didn't pay much attention to it that time, but now that you've pointed it out, maybe we can avoid specifying the person in the new prompt, as this summary message has role = 'user' when passed to the llm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is it doesn't matter at all, thus the ask.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeAct prompt is structured as a dialogue, one message from user, one from agent pattern. The former version was splitting this result in 2 messages sent to the LLM, following the same pattern. It went like this:

role='user'
...The task is ... 

role='agent'
For this task, I did this and I did that and the other...
# action-like

role= 'user'
The code was ... and the assistant had....
# obs-like

In this PR, we're sending them both in the same role='user' message. So it should probably be all like an obs. It will also be merged into the user task, e.g. it will end up like:

role='user'
The task is... 
The assistant did this and that...
The code was... and...

SmartManoj added a commit to SmartManoj/Kevin that referenced this pull request Aug 15, 2024
SmartManoj added a commit to SmartManoj/Kevin that referenced this pull request Aug 15, 2024
Copy link
Contributor

This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Sep 30, 2024
@enyst enyst removed the Stale Inactive for 30 days label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory Management & Context Condense for CodeAct Agent
5 participants