Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SaveState / LoadState not working on 8-bit quantized gguf models #260

Closed
BrainSlugs83 opened this issue Nov 6, 2023 · 11 comments
Closed
Labels
bug Something isn't working

Comments

@BrainSlugs83
Copy link

BrainSlugs83 commented Nov 6, 2023

Not sure if it's working for other model types, I'm only testing on 8-bit models right now, so it might be a wider bug. (Specifically this happens for me with openchat_3.5.Q8_0.gguf).

I'm using the following parameters:

var parameters = new ModelParams(@"C:\models\openchat_3.5.Q8_0.gguf")
{
    ContextSize = 8 * 1024,
    Seed = 1337,
    GpuLayerCount = 15
};

Calling InteractiveExecutor.SaveState produces a json file with the correct tokens (you can pass them to the tokenizer to see them), among other values.
And then calling InteractiveExecutor.LoadState on a new instance just causes it to spit out random garbled text that is not even coherent sentences.

Same problem happens with GetStateData() and LoadState as well.

Btw, I'm using LLamaSharp 0.51 and Cuda11 backend.

@martindevans
Copy link
Member

Could you try this with the newer 0.7.0 release, to confirm if it's still an issue? Thanks.

@martindevans martindevans added the bug Something isn't working label Nov 6, 2023
@BrainSlugs83
Copy link
Author

Yes it still reproduces in 0.7.0. But also, the inferencing is about 10x slower on my machine than in 0.5.1 with the Cuda11 backend. (while using 0.7.0 on both LlamaSharp and LlamaSharp.Backend.Cuda).

@martindevans
Copy link
Member

martindevans commented Nov 7, 2023

The speed problem is a known issue, fortunately we already have a fix merged into master for that! We'll be making a release soon I expect.

For the state thing I'll have a look into it. There's been a huge change in the llama.cpp internals which has probably broken state handling somehow.

@AsakusaRinne AsakusaRinne moved this to 📋 TODO in LLamaSharp Dev Nov 9, 2023
@AsakusaRinne AsakusaRinne mentioned this issue Nov 13, 2023
8 tasks
@AsakusaRinne AsakusaRinne moved this from 📋 TODO to 🏗 In progress in LLamaSharp Dev Nov 24, 2023
@AsakusaRinne
Copy link
Collaborator

Hi I'm working on this issue but I cannot reproduce it with openchat_3.5.Q8_0.gguf. Could you please provide a piece of code and some tips to reproduce it with master branch? Note that the model may have been updated since you opened this issue, please update the model: https://huggingface.co/TheBloke/openchat_3.5-GGUF/blob/main/openchat_3.5.Q8_0.gguf

@BrainSlugs83
Copy link
Author

I'll attempt to repro this tonight if I can.

@BrainSlugs83
Copy link
Author

Looks like its working for saving to and from a file. But not for GetState() / LoadState() it seems some parameter in there is not getting updated during the load state, and so old memory hangs around.

@AsakusaRinne
Copy link
Collaborator

Thanks, I'll try to reproduce it with GetState() / LoadState(). :)

@BrainSlugs83
Copy link
Author

BrainSlugs83 commented Dec 1, 2023

Actually, I spoke to soon...

I had an hour or two long conversation using neural-chat-7b-v3-1.Q4_K_M.gguf (with 4k context and InteractiveExecutor)... I maxed out the tokens probably a half hour in but it stayed coherent (is it just using a ring buffer?)

But when I tried to load a save file from earlier in the session using LoadState... well, it stayed coherent... but it still had all the recent conversation in memory. -- So that seems like a fail to me.

I would expect each state to be self contained and to not bleed through and contaminate other states. So that when you load a state, it's the only thing loaded into the model, and it doesn't have any other "memory" in it. -- Is that assumption incorrect? (If so, how can I achieve an isolated behavior?)

For example, if I run the program from scratch and load that state, everything was fine and it only had the conversation up to that point (but loading it later, sometimes left other info in it's memory somehow). Not sure if that makes any sense. It might be a different bug, I'm not sure.

At any rate, this is a huge improvement over previous as it's at least kind of working now... sometimes... but it's still not 100% working IMHO.

@AsakusaRinne
Copy link
Collaborator

Sorry for this late reply, I didn't notice your message that time.

Is the following case what you mean?

model.Chat("xxx"); // first chat
model.SaveState("state1"); // save the state for some chat histories
model.Chat("xxx"); // second chat
model.SaveState("state2"); // save the state again

model.Load("state2") // You only want the memory during the second chat.

@AsakusaRinne
Copy link
Collaborator

Besides, the latest version is 0.10.0 now. :)

@AsakusaRinne AsakusaRinne moved this from 🏗 In progress to ✅ Done in LLamaSharp Dev May 13, 2024
@AsakusaRinne
Copy link
Collaborator

Closing this issue as inactive. Please feel free to comment here if the problem still reproduces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
Development

No branches or pull requests

3 participants