Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) #6609

Merged
merged 6 commits into from
Apr 11, 2024

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Apr 11, 2024

See discussion in #4218 (comment)

Here's simple code to repro 1.6x faster inference speedup (on Metal) w/ a nested repetition-heavy grammar from #6555 (for JSON schema {"items": {"type": "number"}, "maxItems": 100}):

git clone https://github.com/ochafik/llama.cpp --branch grammar-speedup3 llama.cpp-grammar && \
    cd llama.cpp-grammar && \
    git pull && \
    mkdir -p models/7B

echo '
    decimal-part ::= [0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)?)?)?)?)?)?)?)?)?)?
    integral-part ::= [0-9] | [1-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)?)?)?)?)?)?)?)?)?)?
    number ::= ("-"? integral-part) ("." decimal-part)? ([eE] [-+]? integral-part)? space
    root ::= "[" space number "," space number "," space number "," space number "," space number "," space number "," space number "," space number "," space number "," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number ("," space number)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)? "]" space
    space ::= " "?
' > json_numbers.grammar

hyperfine \
    --warmup 1 --runs 5 \
    -L branch grammar-speedup3,master \
    --setup 'git checkout {branch} && make clean && make -j LLAMA_CURL=1 main' \
    './main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344'
Show results
Benchmark 1: ./main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file json_numbers.grammar -p "List of 20 integers starting from 0" --seed 12344 (branch = grammar-speedup3)
  Time (mean ± σ):      8.405 s ±  0.234 s    [User: 6.806 s, System: 0.412 s]
  Range (min … max):    8.179 s …  8.750 s    5 runs
 
Benchmark 2: ./main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file json_numbers.grammar -p "List of 20 integers starting from 0" --seed 12344 (branch = master)
  Time (mean ± σ):     13.386 s ±  0.109 s    [User: 9.596 s, System: 2.552 s]
  Range (min … max):   13.253 s … 13.520 s    5 runs
 
Summary
  ./main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file json_numbers.grammar -p "List of 20 integers starting from 0" --seed 12344 (branch = grammar-speedup3) ran
    1.59 ± 0.05 times faster than ./main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file json_numbers.grammar -p "List of 20 integers starting from 0" --seed 12344 (branch = master)

cc/ @HanClinto

Copy link
Contributor

github-actions bot commented Apr 11, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 458 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10260.57ms p(95)=26344.27ms fails=, finish reason: stop=408 truncated=50
  • Prompt processing (pp): avg=111.91tk/s p(95)=474.61tk/s
  • Token generation (tg): avg=24.24tk/s p(95)=38.01tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=grammar-speedup3 commit=1e0f466920dbd6747852db864118266e6f256700

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 458 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712858073 --> 1712858699
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 365.55, 365.55, 365.55, 365.55, 365.55, 550.85, 550.85, 550.85, 550.85, 550.85, 461.41, 461.41, 461.41, 461.41, 461.41, 494.11, 494.11, 494.11, 494.11, 494.11, 523.04, 523.04, 523.04, 523.04, 523.04, 570.42, 570.42, 570.42, 570.42, 570.42, 571.43, 571.43, 571.43, 571.43, 571.43, 571.95, 571.95, 571.95, 571.95, 571.95, 605.21, 605.21, 605.21, 605.21, 605.21, 607.03, 607.03, 607.03, 607.03, 607.03, 617.74, 617.74, 617.74, 617.74, 617.74, 619.23, 619.23, 619.23, 619.23, 619.23, 620.24, 620.24, 620.24, 620.24, 620.24, 633.46, 633.46, 633.46, 633.46, 633.46, 628.22, 628.22, 628.22, 628.22, 628.22, 635.44, 635.44, 635.44, 635.44, 635.44, 650.96, 650.96, 650.96, 650.96, 650.96, 580.75, 580.75, 580.75, 580.75, 580.75, 579.02, 579.02, 579.02, 579.02, 579.02, 586.04, 586.04, 586.04, 586.04, 586.04, 587.09, 587.09, 587.09, 587.09, 587.09, 600.08, 600.08, 600.08, 600.08, 600.08, 602.35, 602.35, 602.35, 602.35, 602.35, 605.0, 605.0, 605.0, 605.0, 605.0, 606.66, 606.66, 606.66, 606.66, 606.66, 611.5, 611.5, 611.5, 611.5, 611.5, 614.67, 614.67, 614.67, 614.67, 614.67, 617.97, 617.97, 617.97, 617.97, 617.97, 603.65, 603.65, 603.65, 603.65, 603.65, 608.49, 608.49, 608.49, 608.49, 608.49, 611.01, 611.01, 611.01, 611.01, 611.01, 610.2, 610.2, 610.2, 610.2, 610.2, 608.38, 608.38, 608.38, 608.38, 608.38, 608.98, 608.98, 608.98, 608.98, 608.98, 609.71, 609.71, 609.71, 609.71, 609.71, 614.28, 614.28, 614.28, 614.28, 614.28, 617.28, 617.28, 617.28, 617.28, 617.28, 617.08, 617.08, 617.08, 617.08, 617.08, 621.35, 621.35, 621.35, 621.35, 621.35, 625.61, 625.61, 625.61, 625.61, 625.61, 638.65, 638.65, 638.65, 638.65, 638.65, 639.8, 639.8, 639.8, 639.8, 639.8, 642.7, 642.7, 642.7, 642.7, 642.7, 643.52, 643.52, 643.52, 643.52, 643.52, 643.09, 643.09, 643.09, 643.09, 643.09, 643.23, 643.23, 643.23, 643.23, 643.23, 643.74, 643.74, 643.74, 643.74, 643.74, 646.86, 646.86, 646.86, 646.86, 646.86, 646.39, 646.39, 646.39, 646.39, 646.39, 645.57, 645.57, 645.57, 645.57, 645.57, 645.18, 645.18, 645.18, 645.18, 645.18, 642.57, 642.57, 642.57, 642.57, 642.57, 641.91, 641.91, 641.91, 641.91, 641.91, 640.78, 640.78, 640.78, 640.78, 640.78, 639.17, 639.17, 639.17, 639.17, 639.17, 642.51, 642.51, 642.51, 642.51, 642.51, 645.39, 645.39, 645.39, 645.39, 645.39, 645.71, 645.71, 645.71, 645.71, 645.71, 648.25, 648.25, 648.25, 648.25, 648.25, 650.57, 650.57, 650.57, 650.57, 650.57, 651.72, 651.72, 651.72, 651.72]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 458 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712858073 --> 1712858699
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.26, 36.26, 36.26, 36.26, 36.26, 29.58, 29.58, 29.58, 29.58, 29.58, 24.54, 24.54, 24.54, 24.54, 24.54, 25.98, 25.98, 25.98, 25.98, 25.98, 26.42, 26.42, 26.42, 26.42, 26.42, 26.47, 26.47, 26.47, 26.47, 26.47, 26.56, 26.56, 26.56, 26.56, 26.56, 26.8, 26.8, 26.8, 26.8, 26.8, 27.5, 27.5, 27.5, 27.5, 27.5, 27.6, 27.6, 27.6, 27.6, 27.6, 27.46, 27.46, 27.46, 27.46, 27.46, 26.75, 26.75, 26.75, 26.75, 26.75, 25.79, 25.79, 25.79, 25.79, 25.79, 25.45, 25.45, 25.45, 25.45, 25.45, 25.39, 25.39, 25.39, 25.39, 25.39, 24.92, 24.92, 24.92, 24.92, 24.92, 24.84, 24.84, 24.84, 24.84, 24.84, 24.27, 24.27, 24.27, 24.27, 24.27, 23.3, 23.3, 23.3, 23.3, 23.3, 22.97, 22.97, 22.97, 22.97, 22.97, 23.02, 23.02, 23.02, 23.02, 23.02, 23.19, 23.19, 23.19, 23.19, 23.19, 23.04, 23.04, 23.04, 23.04, 23.04, 22.8, 22.8, 22.8, 22.8, 22.8, 22.69, 22.69, 22.69, 22.69, 22.69, 22.64, 22.64, 22.64, 22.64, 22.64, 22.68, 22.68, 22.68, 22.68, 22.68, 22.73, 22.73, 22.73, 22.73, 22.73, 22.6, 22.6, 22.6, 22.6, 22.6, 22.8, 22.8, 22.8, 22.8, 22.8, 23.07, 23.07, 23.07, 23.07, 23.07, 23.1, 23.1, 23.1, 23.1, 23.1, 22.98, 22.98, 22.98, 22.98, 22.98, 22.85, 22.85, 22.85, 22.85, 22.85, 22.8, 22.8, 22.8, 22.8, 22.8, 22.91, 22.91, 22.91, 22.91, 22.91, 23.07, 23.07, 23.07, 23.07, 23.07, 23.24, 23.24, 23.24, 23.24, 23.24, 23.28, 23.28, 23.28, 23.28, 23.28, 23.35, 23.35, 23.35, 23.35, 23.35, 23.24, 23.24, 23.24, 23.24, 23.24, 23.22, 23.22, 23.22, 23.22, 23.22, 23.15, 23.15, 23.15, 23.15, 23.15, 22.88, 22.88, 22.88, 22.88, 22.88, 22.85, 22.85, 22.85, 22.85, 22.85, 22.84, 22.84, 22.84, 22.84, 22.84, 22.89, 22.89, 22.89, 22.89, 22.89, 23.04, 23.04, 23.04, 23.04, 23.04, 23.12, 23.12, 23.12, 23.12, 23.12, 23.0, 23.0, 23.0, 23.0, 23.0, 22.93, 22.93, 22.93, 22.93, 22.93, 22.66, 22.66, 22.66, 22.66, 22.66, 22.39, 22.39, 22.39, 22.39, 22.39, 22.27, 22.27, 22.27, 22.27, 22.27, 21.99, 21.99, 21.99, 21.99, 21.99, 21.46, 21.46, 21.46, 21.46, 21.46, 21.36, 21.36, 21.36, 21.36, 21.36, 21.38, 21.38, 21.38, 21.38, 21.38, 21.45, 21.45, 21.45, 21.45, 21.45, 21.48, 21.48, 21.48, 21.48, 21.48, 21.64, 21.64, 21.64, 21.64]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 458 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712858073 --> 1712858699
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.37, 0.37, 0.37, 0.37, 0.37, 0.27, 0.27, 0.27, 0.27, 0.27, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.34, 0.34, 0.34, 0.34, 0.34, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.32, 0.32, 0.32, 0.32, 0.32, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.4, 0.51, 0.51, 0.51, 0.51, 0.51, 0.43, 0.43, 0.43, 0.43, 0.43, 0.5, 0.5, 0.5, 0.5, 0.5, 0.45, 0.45, 0.45, 0.45, 0.45, 0.37, 0.37, 0.37, 0.37, 0.37, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 458 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712858073 --> 1712858699
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0]
                    
Loading

@HanClinto
Copy link
Collaborator

Similar to the integration tests, examples/gbnf-validator will eventually need to be updated to use the new API as well. That's lower priority though, and I can do that after we get through all of this.

Reading through this PR, I'm amazed that such a simple change provides such a dramatic speedup. I still just have a hard time believing it's as effective as it is. :)

@ochafik
Copy link
Collaborator Author

ochafik commented Apr 11, 2024

Similar to the integration tests, examples/gbnf-validator will eventually need to be updated to use the new API as well.

Done, thanks!

Reading through this PR, I'm amazed that such a simple change provides such a dramatic speedup. I still just have a hard time believing it's as effective as it is. :)

Yeah I tried half a dozen similar rewrites and only these lucky two struck a chord :-D (let's hope for much more dramatic speedups w/ upcoming changes #4218 (comment))

llama.cpp Outdated
for (auto it = code_points.begin(), end = code_points.end() - 1; it != end; ++it) {
grammar->stacks = llama_grammar_accept(grammar->rules, grammar->stacks, *it);
llama_grammar_accept(grammar->rules, grammar->stacks, *it, tmp_new_stacks);
tmp_new_stacks.swap(grammar->stacks);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better than saying grammar->stacks = tmp_new_stacks;? Because new_stacks is .clear()'d on 11921, it seems like we don't need to save its value here, and we could save a small step (?).

Mainly though, the recursive nature of the swap here was making my eyes cross when trying to follow exactly what this change was doing and how the contents of grammar->stacks and tmp_new_stacks were ping-ponging back and forth in this loop, so getting rid of the .swap() might make it a bit easier to read as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I tried making this change (into a local grammar-speedup4 branch), and it didn't significantly improve things, but it wasn't slower, and I think the code is a bit more readable:

Benchmark 1: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = grammar-speedup4)
  Time (mean ± σ):     12.586 s ±  0.698 s    [User: 8.488 s, System: 1.799 s]
  Range (min … max):   12.012 s … 13.726 s    5 runs

Benchmark 2: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = grammar-speedup3)
  Time (mean ± σ):     12.904 s ±  0.854 s    [User: 8.583 s, System: 1.954 s]
  Range (min … max):   11.846 s … 13.963 s    5 runs

Summary
  ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = grammar-speedup4) ran
    1.03 ± 0.09 times faster than ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = grammar-speedup3)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, turns out my eyes-crossing swap wasn't even making things faster, removed it / looks simpler thanks.

@HanClinto
Copy link
Collaborator

FWIW, I've independently confirmed the (rather dramatic) speedup results of 1.71x on my system (wow!):

Benchmark 1: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = grammar-speedup3)
  Time (mean ± σ):     12.302 s ±  1.341 s    [User: 8.369 s, System: 1.672 s]
  Range (min … max):   11.405 s … 14.642 s    5 runs

Benchmark 2: ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)
  Time (mean ± σ):     20.978 s ±  1.003 s    [User: 11.519 s, System: 6.894 s]
  Range (min … max):   19.908 s … 22.488 s    5 runs

Summary
  ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = grammar-speedup3) ran
    1.71 ± 0.20 times faster than ./main \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file json_numbers.grammar \
        -p "List of 20 integers starting from 0" \
        --seed 12344 (branch = master)

Macbook Pro, Apple M1 Pro, 32 GB of RAM, and about a billion open Firefox tabs.

Really awesome work, @ochafik !

llama.cpp Outdated Show resolved Hide resolved
@HanClinto
Copy link
Collaborator

Other than my minor suggestion re: swap(), I'm really happy with this PR, and can't wait to see it merged in!

Copy link
Collaborator

@HanClinto HanClinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me!

@ochafik ochafik marked this pull request as ready for review April 11, 2024 18:46
@ochafik ochafik merged commit cbaadc9 into ggerganov:master Apr 11, 2024
47 of 50 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Apr 11, 2024
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
…/ reuses) (ggerganov#6609)

* grammars: reserve rejects & next candidates

* grammars: reuse new_stacks

* grammars: fix missing sig change in llama.h

* grammars: fix test (api changed)

* grammars: update gbnf-validator.cpp

* grammars: simpler syntax (no swap)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants