Integrating MemGPT-like Functionality #2937

khushvind · 2024-07-15T09:52:08Z

What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Fix #1748

This PR is proposing an implementation integrating MemGPT-like functionality to OpenDevin's ShortTermMemory inspired by issue #2487. It also uses ideas from #2021.

Give a brief summary of what the PR does, explaining any non-trivial design decisions
The memory is summarized (condensed) when the token count approaches the context window limit but still hasn't hit the limit. All the events in the event stream are considered to be summarizable, except the system and example messages. When the condenser function is called, it tries to summarize events such that at least the first 0.75 fractions of the total tokens get summarized. After this, the summary is added to history. Events are summarized in the order of their occurrence, i.e., initial events are summarized first. The word limit for the summary is set to <=200 by default.

Messages are passed to llm in this order → (system message + example message + summary + all events after the last summarized event)

Currently, this functionality is only implemented for CodeAct Agent.

…ream

Makefile

agenthub/codeact_agent/codeact_agent.py

xingyaoww · 2024-07-18T13:53:21Z

agenthub/codeact_agent/codeact_agent.py

        return self.action_parser.parse(response)

+    def condense(
+        self,
+        state: State,


If you move this .condense function to the LLM class, you can make messages as the input argument. We might need to make each Message a Pydantic class with an additional attribute (e.g., condensable). This may also benefit the ongoing effort that tries to add vision capability in #2848 (comment) (cc @Kaushikdkrikhanu)

This can be done. But on moving .condense to LLM class, it won't be able to add summary to state.history directly. I might also need to pass state as arg to llm.completion along with messages.

If we are trying to make it cleaner (#2937 (comment)), I instead was thinking of moving .condense to MemoryCondenser class, after making each Message a pydantic class. That's where I think .condenser should belong 😅.
WDYT? Which approach seems better?

huh good point... maybe caller to LLM should make sure the summary got added to the State - like what we are doing now?

On a second thought: maybe State is not the place to add these new summarizing actions - EventStream should be: https://github.com/OpenDevin/OpenDevin/blob/c555fb68400bb117011cda3e5d3d95beb5169000/opendevin/controller/agent_controller.py#L395

Maybe you could add an optional argument for LLM class, like condense_callback_fn, which will be called with the condensed Summarized action whenever a condensation is happening. And that callback will add this Summarize action to the event stream?

Can you elaborate on this? How does adding summarizing actions to EventStream help?
MemGPT stores only the summary of events from the start to some event.id. We only need to store one summary in the history. A summary event remains useful until we create a new summary, after which it becomes useless.

Btw, I've moved the .condenser functions to LLM class. MemoryCondenser class used methods of LLM class, so I completely removed it and moved one of its method to LLM class. This also achieves the target (#2937 (comment))

Basically, going forward, we hope to make EventStream a single source of truth for all our past message / history (i.e., you get all the history messages from the event stream), so ideally all your condensation actions also should goes into event stream.

But base on my understanding, your current implementation already emit "AgentSummarizeAction"? I think we just need to make sure that action got stored inside eventstream, we should be good to go on this bullet.

@xingyaoww Oh interesting, I played with summaries in the event stream at some point. A couple of thoughts:

are they 'real events'? That is, if emitted with the same mechanism as others (which they currently were not), would they be propagated downstream to components listening

or, since they replace during runtime the real events they summarize, are they more like 'shadow events'? In which case they're not the truth, but a fleeting version of it, just necessary due to some runtime constraints; the source of truth holds the real events and is able to retrieve them when needed.

Ref: https://opendevin.slack.com/archives/C06QKSD9UBA/p1716737811950479?thread_ts=1716668475.888269&channel=C06QKSD9UBA&message_ts=1716737811.950479

they need to be saved I assume, even if it's one at a time, to retrieve at a disconnect or restart/restore etc. Do we save them along with other events or we save them in State? If we save them in the eventstream, then we may want to consider moving some filtering logic too.

On a side note, making it only one in a run simplifies the logic either way.

agenthub/codeact_agent/codeact_agent.py

xingyaoww

Looks better overall! But i think there are still a few places that we can clean up :)

opendevin/events/action/agent.py

opendevin/llm/condenser.py

opendevin/llm/llm.py

agenthub/codeact_agent/codeact_agent.py

…handles the same issue

agenthub/codeact_agent/codeact_agent.py

enyst · 2024-07-29T21:37:49Z

opendevin/condenser/condenser.py

+                        ]
+                    )
+                ]
+                + message_sequence_to_summarize[cutoff:]


This doesn't seem to work well, we want to convert to strings all messages, but got:

[AgentSummarizeAction(summarized_actions='I asked the user about the task, provided a concise description of the Moon, and wrote a Bash script to solve the Three Queens problem.', summarized_observations='The user requested a concise description of the Moon and a Bash script for the Three Queens problem. The script execution encountered issues, resulting in exit codes -1 and 1.', action='summarize'), <opendevin.llm.messages.Message object at 0x17a962030>, <opendevin.llm.messages.Message object at 0x17a961ac0>, <opendevin.llm.messages.Message object at 0x17a962330>, <opendevin.llm.messages.Message object at 0x17a962000>, <opendevin.llm.messages.Message object at 0x17a961b80>, <opendevin.llm.messages.Message object at 0x17a961f10>, <opendevin.llm.messages.Message object at 0x17a962270>, <opendevin.llm.messages.Message object at 0x17a961e20>, <opendevin.llm.messages.Message object at 0x17a9621b0>, <opendevin.llm.messages.Message object at 0x17a961e80>, <opendevin.llm.messages.Message object at 0x17a961be0>, <opendevin.llm.messages.Message object at 0x17a961c40>, <opendevin.llm.messages.Message object at 0x17a961ee0>, <opendevin.llm.messages.Message object at 0x17a961e50>, <opendevin.llm.messages.Message object at 0x17a9615b0>, <opendevin.llm.messages.Message object at 0x17a961d60>, <opendevin.llm.messages.Message object at 0x17a9611f0>, <opendevin.llm.messages.Message object at 0x17a961f40>, <opendevin.llm.messages.Message object at 0x17a962e40>]

I think it's when it's called a second time recursively.

Yes, it happens in the second recursion. I made some corrections.
When I am trying it out, I am not able to trigger this if statement. How are you doing it?😅

I have a large context (11-12k) compared to the max_input_tokens=4000 value. I suspect that's why it works for me, so you may want to play with the agent, keep adding to context, then switch to a much lower setting. This 12k is triple the 4k limit. (So the 0.75 factor in use in this PR, even if we subtract the initial prompt, is left waaaay behind. 😅)

If it helps, here is my event stream. It's not great for testing in general TBH, but I suspect it's good enough for this issue simply because of that size difference.
eventstream_main.zip

You may want to use all of these settings, like filestore (to save locally on disk the events in the stream, it will create a main sub-directory for it) and cli session enabled. Then replace the 'main' subdirectory in the filestore with the contents here.

I should note, I didn't use the UI. If you're using the UI, you can do the same, just need to set JWT token and use the sub-directory named with the right session-id, instead of "main" sub-directory. Frankly I've never tried to restore a cli session into a UI session, but it should work(tm). Most events are the same obviously, just the UI might have some extra INIT events, which shouldn't matter for restore.

enyst · 2024-07-29T21:45:00Z

opendevin/condenser/condenser.py

+
+        candidate_messages_to_summarize = []
+        tokens_so_far = 0
+        for message in messages:


We gave up using State, but then we still need to keep the user message that is likely to be a task. That is computed here: https://github.com/OpenDevin/OpenDevin/blob/84a6e90dc2e8b1e096af33f7545fa1969853c7d4/opendevin/controller/state/state.py#L169

Maybe we can set it as non-condensable in the agent, before we get here? Otherwise we're losing relevant information, I think.

Yes, something needs to be done about user messages. It is currently losing relevant information. But only marking user messages as non-condensable is not sufficient ig? Some user messages might not contribute much to the relevant information about the final task.?

Currently summarize_messages() has summarized_actions & summarized_observations as args. I was thinking, maybe adding one more arg, summarized_user_messages might also help. This will summarize all the user messages till now.

It is currently losing relevant information. But only marking user messages as non-condensable is not sufficient ig? Some user messages might not contribute much to the relevant information about the final task.?

I suggest to not do it for all user messages, but the most likely task. The method linked above, get_current_user_intent(), returns only one user message: what is likely to be the current task. If we set that one as non-condensable, it will already improve the behavior. For the SWE-bench, that method will work to return the initial user message, which is definitely the task itself.

For interactive user sessions, you're right it's more complicated. The method tries to guess what's the best user message, and returns the last one after a FinishAction, so it's the user message just after a task ended, and a new one started. That's a best-guess, of course, that the new task is in that message, but IMHO it will do for now. Do you think it's likely it won't work?

I was thinking, maybe adding one more arg, summarized_user_messages might also help. This will summarize all the user messages till now.

I see what you mean. In the previous version (PR 2021), we were summarizing the user messages in a summary of their own, as you suggest, when there were no more agent messages, so at a later stage, but again except the current user intent. The current intent as detected (or attempted 😅) remained uncondensed as long as possible. Doing something like this might still be a good option, but I don't think it's necessary for this PR.

The important part, it seems to me, is that if we miss the initial user message, then we risk that the agent will not have enough information about the task to perform correctly.

opendevin/condenser/condenser.py

opendevin/memory/history.py

agenthub/codeact_agent/codeact_agent.py

enyst · 2024-07-30T00:47:57Z

@khushvind This is amazing work, thank you! We definitely need something like this, and it's a smart algo.

I'm sorry I didn't get to look into it very well yet, but I did give it a try, and it seems there are a few issues. If you want, for replicating, I can give you the event stream I used, or, my scenario was something like: use a model with a relatively large context window (I used get-4o with 128k), make sure sessions are enabled[1], do stuff, then when the history gets to some 11k, stop it, config 4k in max_input_tokens, then restart. (or just change max_input_tokens to 4k in the debugger)

This will provoke it not only to apply summarize, but also do it consecutively before getting to a regular prompt, and it seems to behave unexpectedly in part, it loses information. Also, it doesn't seem to save the summary (my run ended with an error, but I'm not sure I see in the code where it saves them, even if it ends well?)

I can play with it more tomorrow I hope. Thank you for this!

[1] If you use opendevin via the UI, save JWT_TOKEN=something in env, it should do; just in case, enable file_store="local" with some file_store_path too in config.toml. If you use it with cli, enable_cli_sessions=true. I find that enabling/restoring sessions is a huge time saver for this kind of stuff, for... reasons. 😅

enyst · 2024-07-30T01:47:30Z

Some thoughts, since we're at it: I think there are two major areas which have affected negatively performance in previous work.

the prompt. It didn't always result in detailed enough information to guide the LLM for the next steps. This PR is using mostly the old prompt from PR 2021, and I'm hoping we can figure out something better. At least, I think we need it to give clear info to the LLM about what files it shouldn't check anymore, and what files it should continue working on. I don't remember if this exact version was doing any of those reliably, but anyway, I wonder if we could adapt something more of memGPT here? I'd favor completely ripping it up and replace, if we can.
missing Recall action. This PR, like 2021, never retrieves the original events, thus must repeat them if it needs them. Now, this behavior in PR 2021 happened because that PR had gotten too large, I had to stop somewhere, and left recall/longterm memory for follow-up PRs. It might not have been the best idea ever, because this feature is incomplete without it. Also memGPT obviously relies on retrievals all the time (99% of all its memory work is recalling originals). That's totally as it should be, and we do plan to do that, but the question is: is it possible to be good enough without it? Cc: @xingyaoww

FWIW in the swe-bench evals on PR 2021, I saw some tasks where the LLM was just restarting, as if it knew nothing. It wasn't using the summary! I think this was due to both prompt and recall. Since this PR is condensing 75% of the messages, leaving a few at the end, I think we won't see that anymore. Nevertheless, TBH if we still don't include recall, we do need to fix the prompt very well...

I found some decent tasks on swe-bench for this, I'm going to retrieve them and run them on this PR. I think I'll also add some recall and see how it goes then.

xingyaoww · 2024-07-31T15:51:22Z

@khushvind Thanks for your amazing work -- Let me know if this is ready for review again!

I have a couple of ideas regarding how we can develop this:

We should develop a benchmark / a way to evaluate a "memory condensation algo quantitatively."
- it could be something like a repurposed version of Needle in a haystack For example, the input is a very long context prompt (128k tokens) with an instruction of "summarize everything, with a goal of XXX (that is related to the needle)"
- re-use benchmarks from MemGPT https://arxiv.org/pdf/2310.08560 to make sure we are getting comparable QA performance (using our condensation algo)
- Running SWE-Bench with condensation enabled (sort of an ultimate metrics); I think getting the first two for eval is probably easier
I think this PR is probably ready to merge (but not enabled by default) after we can get comparable performance to MemGPT using benchmarks in their paper
Once we achieve comparable and better performance with condensation on the SWE-Bench, I think we can enable it by default.

enyst · 2024-08-01T02:10:43Z

opendevin/condenser/prompts.py

+    }
+}
+Make sure to include in observations any relevant information that the agent needs to remember.
+%(events)s


This is a placeholder in the template string, for a key events to replace it. The code can use it like

events_dict = {'events': 'the data in summary input'} formatted_string = SUMMARY_PROMPT_SYSTEM % events_dict

For example https://github.com/OpenDevin/OpenDevin/blob/21ea9953b3026223f65a04cf5699d6ad0915bfa7/agenthub/planner_agent/prompt.py#L167

We can do something similar here, or we can remove this placeholder.

khushvind · 2024-08-01T10:01:51Z

@xingyaoww I am still to run some evaluations on swe-bench to check if it is working fine on different examples. And also need to incorporate some changes that @enyst suggested. I'll let you know once it's done.
I am not able to run evaluations because, for some reason, my WSL started crashing yesterday😅. ig I'll have to do the set-up again, which might take some time.

I'll take a look at these benchmarks by the weekend.

khushvind · 2024-08-01T10:16:33Z

@enyst

It didn't always result in detailed enough information to guide the LLM for the next steps.

The WORD_LIMIT might be the reason for it. It's currently set to 200, and it seems it is not enough to convey enough information in the summary after a lot of events. I need to check this.

This PR is using mostly the old prompt from PR 2021, and I'm hoping we can figure out something better. At least, I think we need it to give clear info to the LLM about what files it shouldn't check anymore, and what files it should continue working on. I don't remember if this exact version was doing any of those reliably, but anyway, I wonder if we could adapt something more of memGPT here? I'd favor completely ripping it up and replace, if we can.

MemGPT's prompt also just asks the LLM to summarize the messages without asking it specifically to keep the info about what files shouldn't be checked. Maybe we can refine ours. I already included some part of MemGPT's prompt, that seemed useful, in current implementation.

enyst · 2024-08-01T11:57:22Z

@xingyaoww I am still to run some evaluations on swe-bench to check if it is working fine on different examples. And also need to incorporate some changes that @enyst suggested. I'll let you know once it's done. I am not able to run evaluations because, for some reason, my WSL started crashing yesterday😅. ig I'll have to do the set-up again, which might take some time.

I'll take a look at these benchmarks by the weekend.

Just to note, I played a little with it on some tasks on SWE-bench, and so far:

as we already know, it needs some stuff for the task and summary (on gpt-4o, which is trying to work)
more surprisingly, Claude doesn't work with our current prompt, it refuses to summarize and continues the task instead. (which is a little rude, if you ask me 🫢 😅)

I'm traveling today and tomorrow, we'll see soon though.

The WORD_LIMIT might be the reason for it. It's currently set to 200, and it seems it is not enough to convey enough information in the summary after a lot of events. I need to check this.

Maybe, but afaik the version in PR 2021 was almost identical without a word limit, and it didn't seem enough on SWE-bench tasks (otherwise in interactive sessions it works, since the user will nudge it). Worth trying though.

MemGPT's prompt also just asks the LLM to summarize the messages without asking it specifically to keep the info about what files shouldn't be checked. Maybe we can refine ours. I already included some part of MemGPT's prompt, that seemed useful, in current implementation.

That's an interestingly simple prompt! To clarify, the reason I say files is that on SWE bench, the LLM appeared to repeat the attempts to work with the wrong files like it had done before, after it figured out the right files. 🤔 Maybe it was a fluke...

li-boxuan · 2024-08-09T10:19:44Z

opendevin/condenser/prompts.py

+    - "action": "summarize"
+    - args:
+      - "summarized_actions": A precise sentence summarizing all the provided actions, written in the first person.
+      - "summarized_observations": A few precise sentences summarizing all the provided observations, written in the third person.


Just curious: what's the motivation behind "first person" for action and "third person" for observations?

We still need to do some work on improving the prompt. This was adapted from #2021. From what I felt, actions are performed by the agent, and it would make more sense to have "first person" for the action summary as this gets passed to the agent in the prompt. Similarly "third person" for observation.
I didn't pay much attention to it that time, but now that you've pointed it out, maybe we can avoid specifying the person in the new prompt, as this summary message has role = 'user' when passed to the llm.

My intuition is it doesn't matter at all, thus the ask.

CodeAct prompt is structured as a dialogue, one message from user, one from agent pattern. The former version was splitting this result in 2 messages sent to the LLM, following the same pattern. It went like this:

role='user' ...The task is ... role='agent' For this task, I did this and I did that and the other... # action-like role= 'user' The code was ... and the assistant had.... # obs-like

In this PR, we're sending them both in the same role='user' message. So it should probably be all like an obs. It will also be merged into the user task, e.g. it will end up like:

role='user' The task is... The assistant did this and that... The code was... and...

All-Hands-AI#2937

github-actions · 2024-09-30T02:01:06Z

This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

khushvind added 13 commits July 6, 2024 14:03

Edited Makefile

fd5adae

updated num_retries

78be88f

Merge remote-tracking branch 'upstream/main' into MemGPT

efc1c55

Fix ruff issues and added summarizer

9ee63e1

Merge remote-tracking branch 'upstream/main'

a4e2a18

Merge remote-tracking branch 'upstream/main' into MemGPT

507a67e

added summarize function

ccffdcc

Integrated MemGPT like functionality

4c3482f

Integrated MemGPT like Functionality

4b1bf53

Merge remote-tracking branch 'upstream/main'

de1b94b

Merge remote-tracking branch 'upstream/main' into MemGPT_Summarize_St…

5fc4ce4

…ream

Retriving Summary

85a8f4b

removed bugs

2845c2c

tobitege reviewed Jul 15, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

removed pip install -q -U google-generativeai from Makefile

7a8299b

xingyaoww reviewed Jul 18, 2024

View reviewed changes

khushvind added 8 commits July 18, 2024 21:27

removed bugs

a7a4c8a

Merge remote-tracking branch 'upstream/main' into MemGPT

0428dcc

corrected the max_input_token

bd9d14f

moved condenser configs to LLMConfig

22e92d7

fixed issue causing error in test on linux

d2b1ae1

Merge remote-tracking branch 'upstream/main' into MemGPT

1afd574

converted each message to Message class with additional attributes

c43ed97

Moved condenser functions to LLM class

6080071

xingyaoww reviewed Jul 23, 2024

View reviewed changes

khushvind and others added 5 commits July 23, 2024 10:06

removed condenser.py file

515e038

Removed ContextWindowExceededError - TokenLimitExceededError already …

44d3c9d

…handles the same issue

Merge remote-tracking branch 'upstream/main' into MemGPT

d93f5ee

Merge branch 'main' into MemGPT

0fece3f

build condenser as mixin class

d59be73

xingyaoww reviewed Jul 29, 2024

View reviewed changes

agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved

agenthub/codeact_agent/codeact_agent.py Show resolved Hide resolved

enyst reviewed Jul 29, 2024

View reviewed changes

opendevin/condenser/condenser.py Show resolved Hide resolved

enyst reviewed Jul 29, 2024

View reviewed changes

opendevin/memory/history.py Show resolved Hide resolved

enyst reviewed Jul 30, 2024

View reviewed changes

agenthub/codeact_agent/codeact_agent.py Outdated Show resolved Hide resolved

khushvind added 2 commits July 30, 2024 20:03

removed bug - pass summary in prompt

490b192

modified summarize_messages

2162c91

khushvind marked this pull request as draft July 30, 2024 18:11

Merge remote-tracking branch 'upstream/main' into MemGPT

a90edc8

Merge remote-tracking branch 'upstream/main' into MemGPT

8d7bc30

enyst reviewed Aug 1, 2024

View reviewed changes

This was referenced Aug 5, 2024

Feat: Condense Message History #3245

Closed

Feat: Condense Event History SmartManoj/Kevin#21

Merged

li-boxuan reviewed Aug 9, 2024

View reviewed changes

SmartManoj added a commit to SmartManoj/Kevin that referenced this pull request Aug 15, 2024

Remove MemoryCondenser test

5c3cb97

All-Hands-AI#2937

SmartManoj added a commit to SmartManoj/Kevin that referenced this pull request Aug 15, 2024

Remove MemoryCondenser test

ec91e67

All-Hands-AI#2937

khushvind added 4 commits August 20, 2024 10:46

Merged with latest main

34caa62

updated prompt for condenser

f30a572

removed print summary message

13a9f64

Modified how agent_controller handles summarization actions

d25e19e

github-actions bot added the Stale Inactive for 30 days label Sep 30, 2024

enyst removed the Stale Inactive for 30 days label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating MemGPT-like Functionality #2937

Integrating MemGPT-like Functionality #2937

khushvind commented Jul 15, 2024 •

edited by xingyaoww

Loading

xingyaoww Jul 18, 2024

khushvind Jul 19, 2024

xingyaoww Jul 19, 2024

xingyaoww Jul 19, 2024 •

edited

Loading

khushvind Jul 22, 2024

xingyaoww Jul 23, 2024

enyst Jul 24, 2024

xingyaoww left a comment

enyst Jul 29, 2024

khushvind Jul 30, 2024

enyst Jul 30, 2024

enyst Jul 29, 2024 •

edited

Loading

khushvind Jul 30, 2024

enyst Jul 30, 2024 •

edited

Loading

enyst commented Jul 30, 2024

enyst commented Jul 30, 2024

xingyaoww commented Jul 31, 2024

enyst Aug 1, 2024

khushvind commented Aug 1, 2024

khushvind commented Aug 1, 2024

enyst commented Aug 1, 2024

li-boxuan Aug 9, 2024

khushvind Aug 10, 2024

li-boxuan Aug 11, 2024

enyst Aug 13, 2024

github-actions bot commented Sep 30, 2024

Integrating MemGPT-like Functionality #2937

Are you sure you want to change the base?

Integrating MemGPT-like Functionality #2937

Conversation

khushvind commented Jul 15, 2024 • edited by xingyaoww Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xingyaoww Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xingyaoww left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enyst Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enyst Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

enyst commented Jul 30, 2024

enyst commented Jul 30, 2024

xingyaoww commented Jul 31, 2024

Choose a reason for hiding this comment

khushvind commented Aug 1, 2024

khushvind commented Aug 1, 2024

enyst commented Aug 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 30, 2024

khushvind commented Jul 15, 2024 •

edited by xingyaoww

Loading

xingyaoww Jul 19, 2024 •

edited

Loading

enyst Jul 29, 2024 •

edited

Loading

enyst Jul 30, 2024 •

edited

Loading