Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: Don't ignore llama.cpp params #8754

Merged
merged 2 commits into from
Aug 4, 2024
Merged

Conversation

ardfork
Copy link
Contributor

@ardfork ardfork commented Jul 29, 2024

I was wondering why I had high variance using server API after launching llama-server with --temp 0.2. I found out that llama.cpp server ignore that and is even lying in the /props endpoint (showing the temp that I had set).

I found out that this was because of this PR: #4668 which makes API endpoint use OAI default values instead of llama.cpp one, I don't understand why anyone would want that, if there is a problem with llama.cpp default parameters, those should be changed not ignored.

@JohannesGaessler
Copy link
Collaborator

I don't understand why #4668 is linked. That PR changed the default values used, but it did not change the logic for how and when these values are set, right? The temperature was already being set to a hard-coded value, just a different one.

In any case, I think the correct way to fix this would be to still use the same default parameters as OAI but to override them with CLI arguments if the user sets them. But I'm not sure whether that can be easily implemented with the current structure.

@ardfork
Copy link
Contributor Author

ardfork commented Jul 29, 2024

I don't understand why #4668 is linked. That PR changed the default values used, but it did not change the logic for how and when these values are set, right? The temperature was already being set to a hard-coded value, just a different one.

Yes, but it added that comment which probably made those default OAI values stay when changing how params are handled in the server.

In any case, I think the correct way to fix this would be to still use the same default parameters as OAI but to override them with CLI arguments if the user sets them. But I'm not sure whether that can be easily implemented with the current structure.

The default server params are set here, and they are correctly overridden by cli arg:

const params = signal({
n_predict: 400,
temperature: 0.7,
repeat_last_n: 256, // 0 = disable penalty, -1 = context size
repeat_penalty: 1.18, // 1.0 = disabled
penalize_nl: false,
top_k: 40, // <= 0 to use vocab size
top_p: 0.95, // 1.0 = disabled
min_p: 0.05, // 0 = disabled
tfs_z: 1.0, // 1.0 = disabled
typical_p: 1.0, // 1.0 = disabled
presence_penalty: 0.0, // 0.0 = disabled
frequency_penalty: 0.0, // 0.0 = disabled
mirostat: 0, // 0/1/2
mirostat_tau: 5, // target entropy
mirostat_eta: 0.1, // learning rate
grammar: '',
n_probs: 0, // no completion_probabilities,
min_keep: 0, // min probs from each sampler,
image_data: [],
cache_prompt: true,
api_key: ''
})

On API request they are being used if no value is provided in the request:

slot.params.stream = json_value(data, "stream", false);
slot.params.cache_prompt = json_value(data, "cache_prompt", false);
slot.params.n_predict = json_value(data, "n_predict", default_params.n_predict);
slot.sparams.top_k = json_value(data, "top_k", default_sparams.top_k);
slot.sparams.top_p = json_value(data, "top_p", default_sparams.top_p);
slot.sparams.min_p = json_value(data, "min_p", default_sparams.min_p);
slot.sparams.tfs_z = json_value(data, "tfs_z", default_sparams.tfs_z);
slot.sparams.typical_p = json_value(data, "typical_p", default_sparams.typical_p);
slot.sparams.temp = json_value(data, "temperature", default_sparams.temp);
slot.sparams.dynatemp_range = json_value(data, "dynatemp_range", default_sparams.dynatemp_range);
slot.sparams.dynatemp_exponent = json_value(data, "dynatemp_exponent", default_sparams.dynatemp_exponent);
slot.sparams.penalty_last_n = json_value(data, "repeat_last_n", default_sparams.penalty_last_n);
slot.sparams.penalty_repeat = json_value(data, "repeat_penalty", default_sparams.penalty_repeat);
slot.sparams.penalty_freq = json_value(data, "frequency_penalty", default_sparams.penalty_freq);
slot.sparams.penalty_present = json_value(data, "presence_penalty", default_sparams.penalty_present);
slot.sparams.mirostat = json_value(data, "mirostat", default_sparams.mirostat);
slot.sparams.mirostat_tau = json_value(data, "mirostat_tau", default_sparams.mirostat_tau);
slot.sparams.mirostat_eta = json_value(data, "mirostat_eta", default_sparams.mirostat_eta);
slot.sparams.penalize_nl = json_value(data, "penalize_nl", default_sparams.penalize_nl);
slot.params.n_keep = json_value(data, "n_keep", slot.params.n_keep);
slot.params.n_discard = json_value(data, "n_discard", default_params.n_discard);
slot.sparams.seed = json_value(data, "seed", default_sparams.seed);
slot.sparams.n_probs = json_value(data, "n_probs", default_sparams.n_probs);
slot.sparams.min_keep = json_value(data, "min_keep", default_sparams.min_keep);

The problem was that the lines I removed were putting default values in the json data, I don't understand why.

@JohannesGaessler
Copy link
Collaborator

How about this: create two copies of gpt_params, change the default values in one of the copies to OAI values, then call gpt_params_parse for both copies, and choose which params to use dynamically based on the endpoint.

@slaren
Copy link
Collaborator

slaren commented Jul 29, 2024

Is there any reason to use the OAI default values? Presumably these are values are tuned for their models, which have nothing to do with the models llama.cpp can use.

@JohannesGaessler
Copy link
Collaborator

Good point. Also I'm noticing that the current values on master aren't actually the ones described in the linked documentation. The documentation says that the default temperature value is 0, but here it is 1.

@JohannesGaessler
Copy link
Collaborator

And regardless of what the documentation says, stream=false and n_predict=-1 kind of seem like bad default values to me since if for some reason the server never generates an EOS token it will just hang, right?

@slaren
Copy link
Collaborator

slaren commented Jul 29, 2024

I think that without passing a value for n_predict it should stop at n_ctx_train since #6638, but I am not sure what is the current state of the server.

@ardfork
Copy link
Contributor Author

ardfork commented Jul 29, 2024

Multi users OAI completions compatibility test is failing: Assertion Failed: invalid number of tokens predicted: 75 <> 128.

It seems to work correctly on my end, n_tokens is respected (on v1/chat/completions endpoint).

@ardfork
Copy link
Contributor Author

ardfork commented Jul 29, 2024

So, before it was using max_tokens for n_predict (and then checking and overriding with n_params later) but with my change it was only checking n_params for it.

I added fallback for max_tokens to imitate previous behavior.

@ngxson
Copy link
Collaborator

ngxson commented Jul 30, 2024

Is there any reason to use the OAI default values? Presumably these are values are tuned for their models, which have nothing to do with the models llama.cpp can use.

Good point. In fact, I believe that from the very first very of server, we just get the default of OAI because we wanted it to be compatible with OAI specs. We kept the same code since then.

But it make sense to get rid of this. IMO I don't expect the temperature to be defaulted to 1.0. It's too random for small models. I guest it make sense on OAI because their models are large.

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 30, 2024
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the tests now succeed (on my local machine). And I agree that it doesn't make sense to use OAI sampling parameters for completely different models in the first place so the code should just be removed.

@ngxson ngxson merged commit 978ba3d into ggerganov:master Aug 4, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Aug 7, 2024
* Don't ignore llama.cpp params

* Add fallback for max_tokens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants