Server: Don't ignore llama.cpp params #8754

ardfork · 2024-07-29T15:16:38Z

I was wondering why I had high variance using server API after launching llama-server with --temp 0.2. I found out that llama.cpp server ignore that and is even lying in the /props endpoint (showing the temp that I had set).

I found out that this was because of this PR: #4668 which makes API endpoint use OAI default values instead of llama.cpp one, I don't understand why anyone would want that, if there is a problem with llama.cpp default parameters, those should be changed not ignored.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

JohannesGaessler · 2024-07-29T15:26:46Z

I don't understand why #4668 is linked. That PR changed the default values used, but it did not change the logic for how and when these values are set, right? The temperature was already being set to a hard-coded value, just a different one.

In any case, I think the correct way to fix this would be to still use the same default parameters as OAI but to override them with CLI arguments if the user sets them. But I'm not sure whether that can be easily implemented with the current structure.

ardfork · 2024-07-29T15:36:48Z

I don't understand why #4668 is linked. That PR changed the default values used, but it did not change the logic for how and when these values are set, right? The temperature was already being set to a hard-coded value, just a different one.

Yes, but it added that comment which probably made those default OAI values stay when changing how params are handled in the server.

In any case, I think the correct way to fix this would be to still use the same default parameters as OAI but to override them with CLI arguments if the user sets them. But I'm not sure whether that can be easily implemented with the current structure.

The default server params are set here, and they are correctly overridden by cli arg:

llama.cpp/examples/server/public/index.html

Lines 301 to 323 in 439b3fc

    
           const params = signal({ 
        
             n_predict: 400, 
        
             temperature: 0.7, 
        
             repeat_last_n: 256, // 0 = disable penalty, -1 = context size 
        
             repeat_penalty: 1.18, // 1.0 = disabled 
        
             penalize_nl: false, 
        
             top_k: 40, // <= 0 to use vocab size 
        
             top_p: 0.95, // 1.0 = disabled 
        
             min_p: 0.05, // 0 = disabled 
        
             tfs_z: 1.0, // 1.0 = disabled 
        
             typical_p: 1.0, // 1.0 = disabled 
        
             presence_penalty: 0.0, // 0.0 = disabled 
        
             frequency_penalty: 0.0, // 0.0 = disabled 
        
             mirostat: 0, // 0/1/2 
        
             mirostat_tau: 5, // target entropy 
        
             mirostat_eta: 0.1, // learning rate 
        
             grammar: '', 
        
             n_probs: 0, // no completion_probabilities, 
        
             min_keep: 0, // min probs from each sampler, 
        
             image_data: [], 
        
             cache_prompt: true, 
        
             api_key: '' 
        
           })

On API request they are being used if no value is provided in the request:

llama.cpp/examples/server/server.cpp

Lines 901 to 924 in 439b3fc

    
           slot.params.stream             = json_value(data, "stream",            false); 
        
           slot.params.cache_prompt       = json_value(data, "cache_prompt",      false); 
        
           slot.params.n_predict          = json_value(data, "n_predict",         default_params.n_predict); 
        
           slot.sparams.top_k             = json_value(data, "top_k",             default_sparams.top_k); 
        
           slot.sparams.top_p             = json_value(data, "top_p",             default_sparams.top_p); 
        
           slot.sparams.min_p             = json_value(data, "min_p",             default_sparams.min_p); 
        
           slot.sparams.tfs_z             = json_value(data, "tfs_z",             default_sparams.tfs_z); 
        
           slot.sparams.typical_p         = json_value(data, "typical_p",         default_sparams.typical_p); 
        
           slot.sparams.temp              = json_value(data, "temperature",       default_sparams.temp); 
        
           slot.sparams.dynatemp_range    = json_value(data, "dynatemp_range",    default_sparams.dynatemp_range); 
        
           slot.sparams.dynatemp_exponent = json_value(data, "dynatemp_exponent", default_sparams.dynatemp_exponent); 
        
           slot.sparams.penalty_last_n    = json_value(data, "repeat_last_n",     default_sparams.penalty_last_n); 
        
           slot.sparams.penalty_repeat    = json_value(data, "repeat_penalty",    default_sparams.penalty_repeat); 
        
           slot.sparams.penalty_freq      = json_value(data, "frequency_penalty", default_sparams.penalty_freq); 
        
           slot.sparams.penalty_present   = json_value(data, "presence_penalty",  default_sparams.penalty_present); 
        
           slot.sparams.mirostat          = json_value(data, "mirostat",          default_sparams.mirostat); 
        
           slot.sparams.mirostat_tau      = json_value(data, "mirostat_tau",      default_sparams.mirostat_tau); 
        
           slot.sparams.mirostat_eta      = json_value(data, "mirostat_eta",      default_sparams.mirostat_eta); 
        
           slot.sparams.penalize_nl       = json_value(data, "penalize_nl",       default_sparams.penalize_nl); 
        
           slot.params.n_keep             = json_value(data, "n_keep",            slot.params.n_keep); 
        
           slot.params.n_discard          = json_value(data, "n_discard",         default_params.n_discard); 
        
           slot.sparams.seed              = json_value(data, "seed",              default_sparams.seed); 
        
           slot.sparams.n_probs           = json_value(data, "n_probs",           default_sparams.n_probs); 
        
           slot.sparams.min_keep          = json_value(data, "min_keep",          default_sparams.min_keep);

The problem was that the lines I removed were putting default values in the json data, I don't understand why.

JohannesGaessler · 2024-07-29T15:37:42Z

How about this: create two copies of gpt_params, change the default values in one of the copies to OAI values, then call gpt_params_parse for both copies, and choose which params to use dynamically based on the endpoint.

slaren · 2024-07-29T15:40:05Z

Is there any reason to use the OAI default values? Presumably these are values are tuned for their models, which have nothing to do with the models llama.cpp can use.

JohannesGaessler · 2024-07-29T15:44:20Z

Good point. Also I'm noticing that the current values on master aren't actually the ones described in the linked documentation. The documentation says that the default temperature value is 0, but here it is 1.

JohannesGaessler · 2024-07-29T15:46:37Z

And regardless of what the documentation says, stream=false and n_predict=-1 kind of seem like bad default values to me since if for some reason the server never generates an EOS token it will just hang, right?

slaren · 2024-07-29T15:54:03Z

I think that without passing a value for n_predict it should stop at n_ctx_train since #6638, but I am not sure what is the current state of the server.

ardfork · 2024-07-29T16:00:26Z

Multi users OAI completions compatibility test is failing: Assertion Failed: invalid number of tokens predicted: 75 <> 128.

It seems to work correctly on my end, n_tokens is respected (on v1/chat/completions endpoint).

ardfork · 2024-07-29T16:20:33Z

So, before it was using max_tokens for n_predict (and then checking and overriding with n_params later) but with my change it was only checking n_params for it.

I added fallback for max_tokens to imitate previous behavior.

ngxson · 2024-07-30T08:56:09Z

Is there any reason to use the OAI default values? Presumably these are values are tuned for their models, which have nothing to do with the models llama.cpp can use.

Good point. In fact, I believe that from the very first very of server, we just get the default of OAI because we wanted it to be compatible with OAI specs. We kept the same code since then.

But it make sense to get rid of this. IMO I don't expect the temperature to be defaulted to 1.0. It's too random for small models. I guest it make sense on OAI because their models are large.

examples/server/utils.hpp

JohannesGaessler

I can confirm that the tests now succeed (on my local machine). And I agree that it doesn't make sense to use OAI sampling parameters for completely different models in the first place so the code should just be removed.

* Don't ignore llama.cpp params * Add fallback for max_tokens

github-actions bot added examples server labels Jul 29, 2024

ardfork force-pushed the server-params branch from a0a2b61 to 980f3c8 Compare July 29, 2024 15:43

Don't ignore llama.cpp params

980f3c8

Add fallback for max_tokens

1dff278

ngxson reviewed Jul 30, 2024

View reviewed changes

examples/server/utils.hpp Show resolved Hide resolved

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 30, 2024

JohannesGaessler approved these changes Jul 31, 2024

View reviewed changes

ngxson merged commit 978ba3d into ggerganov:master Aug 4, 2024
53 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Aug 7, 2024

Server: Don't ignore llama.cpp params (ggerganov#8754)

17e8bab

* Don't ignore llama.cpp params * Add fallback for max_tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: Don't ignore llama.cpp params #8754

Server: Don't ignore llama.cpp params #8754

ardfork commented Jul 29, 2024

JohannesGaessler commented Jul 29, 2024

ardfork commented Jul 29, 2024

JohannesGaessler commented Jul 29, 2024

slaren commented Jul 29, 2024

JohannesGaessler commented Jul 29, 2024

JohannesGaessler commented Jul 29, 2024

slaren commented Jul 29, 2024

ardfork commented Jul 29, 2024

ardfork commented Jul 29, 2024 •

edited

Loading

ngxson commented Jul 30, 2024

JohannesGaessler left a comment •

edited

Loading

Server: Don't ignore llama.cpp params #8754

Server: Don't ignore llama.cpp params #8754

Conversation

ardfork commented Jul 29, 2024

JohannesGaessler commented Jul 29, 2024

ardfork commented Jul 29, 2024

JohannesGaessler commented Jul 29, 2024

slaren commented Jul 29, 2024

JohannesGaessler commented Jul 29, 2024

JohannesGaessler commented Jul 29, 2024

slaren commented Jul 29, 2024

ardfork commented Jul 29, 2024

ardfork commented Jul 29, 2024 • edited Loading

ngxson commented Jul 30, 2024

JohannesGaessler left a comment • edited Loading

Choose a reason for hiding this comment

ardfork commented Jul 29, 2024 •

edited

Loading

JohannesGaessler left a comment •

edited

Loading