Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: stop generation at n_ctx_train if n_predict is not set #6638

Merged
merged 7 commits into from
Apr 26, 2024

Conversation

phymbert
Copy link
Collaborator

@phymbert phymbert commented Apr 12, 2024

Context

If the model hallucinates (EOS-less generation), server will go to infinite loop if n_predict is not set.

It is a wrong usage of the server or the model:

But as it brings confusion, I propose to stop the generation at the size of the context which with the model was trained if self-extent context is disabled.

Tests

server --hf-repo ggml-org/models --hf-file phi-2/ggml-model-q4_0.gguf --model phi-2-q4_0.gguf --parallel 1 --ctx-size 4096 --log-format text -ngl 33
curl http://localhost:8080/completion --data '{"prompt": "hallucinate"}'

WARN [ update_slots] n_predict is not set and self-context extend is disabled. Limiting generated tokens to n_ctx_train to avoid EOS-less generation infinite loop | tid="127864424759296" timestamp=1713526370 params.n_predict=-1 slot.n_predict=-1 slot.n_decoded=2048 n_slots=1 n_ctx=4096 n_ctx_train=2048 ga_n=1

@ggerganov
Copy link
Owner

Before we make the change, we should see if the update_slots error is really reproducible. If it is, it's a bug that we first need to fix. If we merge the change now, we might hide the underlying issue

@FSSRepo
Copy link
Collaborator

FSSRepo commented Apr 13, 2024

@ggerganov I have created a web application to stress-test the server and see how it handles multiple clients sending random questions and documents simultaneously. I tested it with four clients using mixtral 8x7B q8_0 in x3 RTX 3090 for one hour, and the server didn't encounter any issues.

master:

Screenshot 2024-04-12 154055

But the gg/flash-attn branch sometimes generates NaNs in the query tensor, which are not generated by the flash attention kernel (this happens before the flash attention kernel).

Screenshot 2024-04-09 222833

I will keep investigating for the meantime.

@phymbert
Copy link
Collaborator Author

Agreed, I am doing performance and capacity tests since 2 month+, there is no such bug. The server is stable and production ready.

@ochafik
Copy link
Collaborator

ochafik commented Apr 13, 2024

Alongside an optional cap, I think we should make the server stop generating when the connection is closed for whatever reason (clients may well have a timeout or interrupt things manually, but the server keeps going / stays busy needlessly). Maybe an interrupt check callback to call before generating each token?

@phymbert
Copy link
Collaborator Author

Yeah, it is identified in:

@phymbert
Copy link
Collaborator Author

Before we make the change, we should see if the update_slots error is really reproducible.

We can conclude that the user was using an old version. That's it.

@phymbert
Copy link
Collaborator Author

phymbert commented Apr 19, 2024

@ggerganov, finally, I would prefer not to go this way but to stop the generation at n_ctx with a warning, instead of printing a warning each time if n_predict is not set.

@phymbert phymbert marked this pull request as draft April 19, 2024 07:10
@ggerganov
Copy link
Owner

Ok. I'm not sure we have to da anything at this point - seems the latest version work OK

@slaren
Copy link
Collaborator

slaren commented Apr 19, 2024

There should still be some limit to avoid getting into an infinite loop in the server.

@phymbert phymbert changed the title server: cap n_predict if not set to avoid infinite loop server: stop generation at n_ctx_train is n_predict is not set Apr 19, 2024
@phymbert phymbert marked this pull request as ready for review April 19, 2024 11:34
@phymbert phymbert requested a review from slaren April 19, 2024 11:35
@phymbert
Copy link
Collaborator Author

phymbert commented Apr 19, 2024

@ggerganov @slaren please have a look to this proposal

@phymbert phymbert changed the title server: stop generation at n_ctx_train is n_predict is not set server: stop generation at n_ctx_train if n_predict is not set Apr 19, 2024
@slaren
Copy link
Collaborator

slaren commented Apr 19, 2024

When this happens, the response of /completion has these fields:

  "truncated": true,
  "stopped_eos": false,
  "stopped_word": false,
  "stopped_limit": false,

I am not familiar with the meaning of each of these flags. Should this be different? Maybe stopped_limit should be true to indicate that a limit was hit?

server: infinite loop: set stop limit to true
@phymbert phymbert force-pushed the hp/server/avoid-infinite-loop branch from 21702de to 1423fce Compare April 19, 2024 11:58
examples/server/server.cpp Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator

slaren commented Apr 19, 2024

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

@phymbert
Copy link
Collaborator Author

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yeah, it was the first version, but I feel it noisy to log this warning at each request: 6fd5ad5

@slaren
Copy link
Collaborator

slaren commented Apr 19, 2024

That's not exactly what I mean. Basically I would just change the default to n_ctx_train (or other value) in the line that sets slot.params.n_predict = json_value(data, "n_predict", default_params.n_predict);. No need to print any warnings, just document it. The user can check stopped_limit or set a different limit if needed.

@phymbert
Copy link
Collaborator Author

I see, I am OK with both solutions even if it will be sort of a breaking change to set n_predict all the time.

AFAIK not all models hallucinate and not on all completion, plus normally it should always emmit EOS token if the trained chat template is in used in chat completion endpoint.

@ggerganov up to you, but we need to stop this infinite loop recurrent concern some way.

This comment has been minimized.

@slaren
Copy link
Collaborator

slaren commented Apr 19, 2024

This would be simple if context shifting was opt-in, then there would always be a hard limit of n_ctx tokens. I am not sure that enabling context shift by default and without a way to disable it is a good idea, it can result in very poor generation quality and it is not the normal behavior for LLM providers, as far as I know.

@phymbert
Copy link
Collaborator Author

This would be simple if context shifting was opt-in, then there would always be a hard limit of n_ctx tokens. I am not sure that enabling context shift by default and without a way to disable it is a good idea, it can result in very poor generation quality and it is not the normal behavior for LLM providers, as far as I know.

Oh yes, and it is so slow in the current implementation, blocking the whole server.

@phymbert phymbert marked this pull request as draft April 21, 2024 18:28
@phymbert phymbert marked this pull request as ready for review April 21, 2024 18:29
@phymbert
Copy link
Collaborator Author

@ggerganov up to you, but we need to stop this infinite loop recurrent concern some way.

@ggerganov I think with the removal of hard coded stop tokens, this PR is becoming more important

@ggerganov
Copy link
Owner

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR)

@phymbert
Copy link
Collaborator Author

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR)

@ggerganov @slaren Finally I prefer to keep checking at each token if we do not exceed n_ctx_train because one can simply pass n_predict = -1 in the request payload and the server will still go to infinite loop. I feel this approach safer with a proper warning.

@phymbert phymbert merged commit 7f5ff55 into master Apr 26, 2024
61 of 64 checks passed
@phymbert phymbert deleted the hp/server/avoid-infinite-loop branch April 26, 2024 10:15
phymbert added a commit that referenced this pull request Apr 26, 2024
phymbert added a commit to phymbert/llama.cpp that referenced this pull request Apr 26, 2024
phymbert added a commit that referenced this pull request Apr 27, 2024
…n_predict (#6935)

* ci: server: fix python env

* ci: server: fix server tests after #6638

* ci: server: fix windows is not building PR branch
nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
…gerganov#6638)

* server: cap n_predict if not set to n_ctx_train

* server: fix infinite loop

* server: infinite loop, move in process_token
server: infinite loop: set stop limit to true

* minor: spaces

* minor: spaces

* server: include prompt tokens in the EOS limit
nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
…n_predict (ggerganov#6935)

* ci: server: fix python env

* ci: server: fix server tests after ggerganov#6638

* ci: server: fix windows is not building PR branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants