-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SaveState / LoadState not working on 8-bit quantized gguf models #260
Comments
Could you try this with the newer 0.7.0 release, to confirm if it's still an issue? Thanks. |
Yes it still reproduces in 0.7.0. But also, the inferencing is about 10x slower on my machine than in 0.5.1 with the Cuda11 backend. (while using 0.7.0 on both LlamaSharp and LlamaSharp.Backend.Cuda). |
The speed problem is a known issue, fortunately we already have a fix merged into master for that! We'll be making a release soon I expect. For the state thing I'll have a look into it. There's been a huge change in the llama.cpp internals which has probably broken state handling somehow. |
Hi I'm working on this issue but I cannot reproduce it with openchat_3.5.Q8_0.gguf. Could you please provide a piece of code and some tips to reproduce it with master branch? Note that the model may have been updated since you opened this issue, please update the model: https://huggingface.co/TheBloke/openchat_3.5-GGUF/blob/main/openchat_3.5.Q8_0.gguf |
I'll attempt to repro this tonight if I can. |
Looks like its working for saving to and from a file. But not for GetState() / LoadState() it seems some parameter in there is not getting updated during the load state, and so old memory hangs around. |
Thanks, I'll try to reproduce it with GetState() / LoadState(). :) |
Actually, I spoke to soon... I had an hour or two long conversation using But when I tried to load a save file from earlier in the session using I would expect each state to be self contained and to not bleed through and contaminate other states. So that when you load a state, it's the only thing loaded into the model, and it doesn't have any other "memory" in it. -- Is that assumption incorrect? (If so, how can I achieve an isolated behavior?) For example, if I run the program from scratch and load that state, everything was fine and it only had the conversation up to that point (but loading it later, sometimes left other info in it's memory somehow). Not sure if that makes any sense. It might be a different bug, I'm not sure. At any rate, this is a huge improvement over previous as it's at least kind of working now... sometimes... but it's still not 100% working IMHO. |
Sorry for this late reply, I didn't notice your message that time. Is the following case what you mean? model.Chat("xxx"); // first chat
model.SaveState("state1"); // save the state for some chat histories
model.Chat("xxx"); // second chat
model.SaveState("state2"); // save the state again
model.Load("state2") // You only want the memory during the second chat. |
Besides, the latest version is 0.10.0 now. :) |
Closing this issue as inactive. Please feel free to comment here if the problem still reproduces. |
Not sure if it's working for other model types, I'm only testing on 8-bit models right now, so it might be a wider bug. (Specifically this happens for me with
openchat_3.5.Q8_0.gguf
).I'm using the following parameters:
Calling
InteractiveExecutor.SaveState
produces a json file with the correct tokens (you can pass them to the tokenizer to see them), among other values.And then calling
InteractiveExecutor.LoadState
on a new instance just causes it to spit out random garbled text that is not even coherent sentences.Same problem happens with
GetStateData()
andLoadState
as well.Btw, I'm using LLamaSharp 0.51 and Cuda11 backend.
The text was updated successfully, but these errors were encountered: