Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: Fix system_prompt handling #7153

Merged
merged 1 commit into from
May 11, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 8, 2024

Resolves #7152 #7089

  • assistant_name and user_name are assigned but unused ==> they are now removed. Users can use "stop" to specify the antiprompt
  • system_prompt will now simply be a string

@ngxson ngxson requested a review from ggerganov May 8, 2024 21:51
@ngxson
Copy link
Collaborator Author

ngxson commented May 8, 2024

@ggerganov @phymbert In addition to this, I found a problem that any user can change the system prompt (which will affect other users using the same server). This poses a small security risk. Do you think that we need to introduce some kind of "system prompt lock" in the future? (maybe some kinds of --can-modify-system-prompt param, then the sys prompt is readonly by default)

@ngxson ngxson requested a review from phymbert May 8, 2024 21:54
Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I never used system prompt, so I have no strong position.

Copy link
Contributor

github-actions bot commented May 8, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 531 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8897.18ms p(95)=21521.99ms fails=, finish reason: stop=469 truncated=62
  • Prompt processing (pp): avg=111.02tk/s p(95)=552.82tk/s
  • Token generation (tg): avg=30.92tk/s p(95)=46.57tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=xsn/fix_sys_prompt commit=9217f5ef361b1720aaf27107721bb97640144034

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715210863 --> 1715211493
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 323.97, 323.97, 323.97, 323.97, 323.97, 479.61, 479.61, 479.61, 479.61, 479.61, 513.61, 513.61, 513.61, 513.61, 513.61, 543.73, 543.73, 543.73, 543.73, 543.73, 588.82, 588.82, 588.82, 588.82, 588.82, 592.71, 592.71, 592.71, 592.71, 592.71, 598.61, 598.61, 598.61, 598.61, 598.61, 627.92, 627.92, 627.92, 627.92, 627.92, 642.34, 642.34, 642.34, 642.34, 642.34, 655.53, 655.53, 655.53, 655.53, 655.53, 665.92, 665.92, 665.92, 665.92, 665.92, 693.14, 693.14, 693.14, 693.14, 693.14, 722.68, 722.68, 722.68, 722.68, 722.68, 763.84, 763.84, 763.84, 763.84, 763.84, 741.49, 741.49, 741.49, 741.49, 741.49, 746.55, 746.55, 746.55, 746.55, 746.55, 746.17, 746.17, 746.17, 746.17, 746.17, 764.06, 764.06, 764.06, 764.06, 764.06, 769.23, 769.23, 769.23, 769.23, 769.23, 771.09, 771.09, 771.09, 771.09, 771.09, 778.91, 778.91, 778.91, 778.91, 778.91, 778.51, 778.51, 778.51, 778.51, 778.51, 782.27, 782.27, 782.27, 782.27, 782.27, 803.5, 803.5, 803.5, 803.5, 803.5, 802.79, 802.79, 802.79, 802.79, 802.79, 806.75, 806.75, 806.75, 806.75, 806.75, 816.58, 816.58, 816.58, 816.58, 816.58, 819.77, 819.77, 819.77, 819.77, 819.77, 820.24, 820.24, 820.24, 820.24, 820.24, 818.46, 818.46, 818.46, 818.46, 818.46, 820.0, 820.0, 820.0, 820.0, 820.0, 824.45, 824.45, 824.45, 824.45, 824.45, 822.77, 822.77, 822.77, 822.77, 822.77, 822.29, 822.29, 822.29, 822.29, 822.29, 836.09, 836.09, 836.09, 836.09, 836.09, 843.35, 843.35, 843.35, 843.35, 843.35, 854.23, 854.23, 854.23, 854.23, 854.23, 853.71, 853.71, 853.71, 853.71, 853.71, 851.75, 851.75, 851.75, 851.75, 851.75, 850.63, 850.63, 850.63, 850.63, 850.63, 855.78, 855.78, 855.78, 855.78, 855.78, 854.14, 854.14, 854.14, 854.14, 854.14, 834.21, 834.21, 834.21, 834.21, 834.21, 840.22, 840.22, 840.22, 840.22, 840.22, 840.7, 840.7, 840.7, 840.7, 840.7, 838.28, 838.28, 838.28, 838.28, 838.28, 836.55, 836.55, 836.55, 836.55, 836.55, 840.77, 840.77, 840.77, 840.77, 840.77, 843.22, 843.22, 843.22, 843.22, 843.22, 840.69, 840.69, 840.69, 840.69, 840.69, 841.38, 841.38, 841.38, 841.38, 841.38, 841.5, 841.5, 841.5, 841.5, 841.5, 835.24, 835.24, 835.24, 835.24, 835.24, 838.04, 838.04, 838.04, 838.04, 838.04, 838.76, 838.76, 838.76, 838.76, 838.76, 843.22, 843.22, 843.22, 843.22, 843.22, 841.2, 841.2, 841.2, 841.2, 841.2, 840.59, 840.59, 840.59, 840.59, 840.59, 841.58, 841.58, 841.58, 841.58, 841.58, 841.86, 841.86, 841.86, 841.86, 841.86, 842.76, 842.76, 842.76, 842.76, 842.76, 842.76]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715210863 --> 1715211493
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.47, 36.47, 36.47, 36.47, 36.47, 32.9, 32.9, 32.9, 32.9, 32.9, 30.22, 30.22, 30.22, 30.22, 30.22, 31.27, 31.27, 31.27, 31.27, 31.27, 31.45, 31.45, 31.45, 31.45, 31.45, 31.79, 31.79, 31.79, 31.79, 31.79, 33.1, 33.1, 33.1, 33.1, 33.1, 33.3, 33.3, 33.3, 33.3, 33.3, 33.53, 33.53, 33.53, 33.53, 33.53, 33.61, 33.61, 33.61, 33.61, 33.61, 33.02, 33.02, 33.02, 33.02, 33.02, 33.05, 33.05, 33.05, 33.05, 33.05, 33.29, 33.29, 33.29, 33.29, 33.29, 31.59, 31.59, 31.59, 31.59, 31.59, 31.36, 31.36, 31.36, 31.36, 31.36, 31.18, 31.18, 31.18, 31.18, 31.18, 31.16, 31.16, 31.16, 31.16, 31.16, 31.26, 31.26, 31.26, 31.26, 31.26, 30.37, 30.37, 30.37, 30.37, 30.37, 30.39, 30.39, 30.39, 30.39, 30.39, 30.08, 30.08, 30.08, 30.08, 30.08, 29.89, 29.89, 29.89, 29.89, 29.89, 29.9, 29.9, 29.9, 29.9, 29.9, 29.79, 29.79, 29.79, 29.79, 29.79, 29.66, 29.66, 29.66, 29.66, 29.66, 29.8, 29.8, 29.8, 29.8, 29.8, 30.01, 30.01, 30.01, 30.01, 30.01, 29.84, 29.84, 29.84, 29.84, 29.84, 29.46, 29.46, 29.46, 29.46, 29.46, 29.34, 29.34, 29.34, 29.34, 29.34, 29.49, 29.49, 29.49, 29.49, 29.49, 29.59, 29.59, 29.59, 29.59, 29.59, 29.72, 29.72, 29.72, 29.72, 29.72, 29.85, 29.85, 29.85, 29.85, 29.85, 29.82, 29.82, 29.82, 29.82, 29.82, 29.79, 29.79, 29.79, 29.79, 29.79, 29.7, 29.7, 29.7, 29.7, 29.7, 29.4, 29.4, 29.4, 29.4, 29.4, 29.43, 29.43, 29.43, 29.43, 29.43, 29.5, 29.5, 29.5, 29.5, 29.5, 29.66, 29.66, 29.66, 29.66, 29.66, 29.69, 29.69, 29.69, 29.69, 29.69, 29.74, 29.74, 29.74, 29.74, 29.74, 29.6, 29.6, 29.6, 29.6, 29.6, 29.22, 29.22, 29.22, 29.22, 29.22, 28.99, 28.99, 28.99, 28.99, 28.99, 27.96, 27.96, 27.96, 27.96, 27.96, 27.91, 27.91, 27.91, 27.91, 27.91, 27.98, 27.98, 27.98, 27.98, 27.98, 28.06, 28.06, 28.06, 28.06, 28.06, 28.12, 28.12, 28.12, 28.12, 28.12, 28.28, 28.28, 28.28, 28.28, 28.28, 28.38, 28.38, 28.38, 28.38, 28.38, 28.36, 28.36, 28.36, 28.36, 28.36, 28.28, 28.28, 28.28, 28.28, 28.28, 28.23, 28.23, 28.23, 28.23, 28.23, 28.31, 28.31, 28.31, 28.31, 28.31, 28.35, 28.35, 28.35, 28.35, 28.35, 28.46, 28.46, 28.46, 28.46, 28.46, 28.48, 28.48, 28.48, 28.48, 28.48, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715210863 --> 1715211493
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.36, 0.36, 0.36, 0.36, 0.36, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.34, 0.34, 0.34, 0.34, 0.34, 0.26, 0.26, 0.26, 0.26, 0.26, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.33, 0.33, 0.33, 0.33, 0.33, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.08, 0.08, 0.08, 0.08, 0.08, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.36, 0.36, 0.36, 0.36, 0.36, 0.49, 0.49, 0.49, 0.49, 0.49, 0.53, 0.53, 0.53, 0.53, 0.53, 0.57, 0.57, 0.57, 0.57, 0.57, 0.41, 0.41, 0.41, 0.41, 0.41, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715210863 --> 1715211493
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0]
                    
Loading

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the system prompt can be changed by anyone. Maybe it's better to have it locked by default and be able to change it only if a specific argument is added when starting server

@mofosyne mofosyne added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix help wanted Extra attention is needed labels May 9, 2024
@scottstirling
Copy link

scottstirling commented May 11, 2024

System message == system prompt?

Llama 3 and OpenAI chatml specify system role and messages in their APIs. Sometimes AKA system prompt?

The ability to include role="system" from the application side is essential to many features being implemented in applications which achieve outcomes often by inserting a detailed system message at the head of the chat, and the system message can be changed from one request to the next even in the same chat.

https://wegrok.ai uses system messages, and I am sure openAI uses them for their "Custom instructions" and meta.ai uses them for various chat starters.

I would argue the ability to lock out system messages in the APIs is undesirable. The system messages do not impact the model or other users unless they are applied at the server level, but that is not the usual case afaict.

@ngxson
Copy link
Collaborator Author

ngxson commented May 11, 2024

@scottstirling system_prompt mentioned in this PR is a prefix that can be prepend to all sequences being process by server. The reason why it exists was to demonstrate its usage with multi-sequence. Since system prompt is fixed, it will be decoded only once even when you have multiple sequences (better performance).

Having system message as system prompt can be useful, but only when we can correctly format it. For the moment, that will be quite messy to implement, so I'd rather not to have this feature and in fact, leave it to application layer to handle (i.e. you can have a "proxy" between llama.cpp server and frontend)

Also, just to be clear, this idea is not the same as what you're mentioning. https://wegrok.ai (and all other APIs) evaluate the system message for each input sequence, not all sequences, so each user can set their own system message. If they used the same idea as system_prompt in llama.cpp, they would have forced all users to use the same system message

@phymbert Because a while ago you mentioned that you're using server in a prod-like environment, I think it's better to also ask you about the risk. But if that's not a big problem so you can ignore that. Thank you anyway for the confirmation.

@ngxson ngxson merged commit 72c177c into ggerganov:master May 11, 2024
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug help wanted Extra attention is needed Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abort in example server (/completions route) given string-type system_prompt
5 participants