ggml : full ALiBi support #7192

ggerganov · 2024-05-10T07:49:36Z

Implementing ALiBI as explained here: https://github.com/ofirpress/attention_with_linear_biases

If I understand correctly, the ALiBi bias can become part of the KQ_mask:

A - ALiBi integer matrix (0.0 when no ALiBi)
m - ALiBi head-specific slope parameter (1.0 when no ALiBi)

KQ_mask = causal_mask + A

soft_max(KQ*scale + KQ_mask*m)

Therefore there is no need to create the KQ_pos tensor as I initially thought. If this is correct, then we can simplify the ggml_soft_max_ext() operator and no longer pass the positions tensor. Extending Flash Attention support should also be possible and simple

This PR is needed to properly support Jina embedding models: #6826

Worflow

Remove ggml_alibi()
Update ggml_soft_max_ext() to no longer accept pos tensor:
- CPU
- Metal
- CUDA
- SYCL
- Vulkan (requires change similar to d0592d4, cc @0cc4m)
Update ggml_flash_attn_ext() to support the new ALiBi KQ_mask:
- CPU
- Metal
- CUDA

Tests

https://huggingface.co/smallcloudai/Refact-1_6-base

make -j && ./main -m ./models/refact-1b-base/ggml-model-f16.gguf -p "bool is_prime(" -e -n 256 -s 1 --temp 0.0 --verbose-prompt

https://huggingface.co/smallcloudai/Refact-1_6B-fim

make -j && ./infill -m models/refact-1b-fim/ggml-model-f16.gguf --in-prefix "def helloworld(): print(\"hel" --in-suffix " print(\"goodbye world\") " -ngl 99 --temp 0 --verbose-prompt

ggerganov · 2024-05-10T10:58:29Z

ggml-metal.metal

+        // TODO: is there a better way to handle -INFINITY?
+        dst_data[i00] = src[0] == -INFINITY ? -MAXHALF : src[0];


Not sure I fully understand the problem, but when we cast the KQ_mask from F32 to F16 in this kernel, the F32 -INFINIFTY values are converted to some F16 value that when multiplied with the ALiBi slope results in garbage. Even if I force the slope to be 1.0h it still produces garbage. I expected that it would still be -INFINITY, but it's not the case. Since there is no way to print these values in Metal, this is the workaround that I found to work, but it feels a bit poor

JoanFM · 2024-05-10T11:39:22Z

ggml.c


    for (int i1 = ir0; i1 < ir1; i1++) {
+        // ALiBi
+        const uint32_t h = (i1/ne01)%ne02; // head
+        const float slope = (max_bias > 0.0f) ? h < n_head_log2 ? powf(m0, h + 1) : powf(m1, 2*(h - n_head_log2) + 1) : 1.0f;


I do not see how this directly work? how does it implement this logic?

[0, 1, 2, 3], [1, 0, 1, 2]

and the negative slope or negativity of this matrix?

We compute the integer matrix here:

llama.cpp/llama.cpp

Lines 10966 to 10987 in f7055d3

// For causal attention, use only the previous KV cells

// of the correct sequence for each token of the batch.

// It's assumed that if a token in the batch has multiple sequences, they are equivalent.

for (int h = 0; h < 1; ++h) {

for (int j = 0; j < n_tokens; ++j) {

const llama_pos pos = batch.pos[j];

const llama_seq_id seq_id = batch.seq_id[j][0];

for (int i = 0; i < n_kv; ++i) {

float f;

if (!lctx.kv_self.cells[i].has_seq_id(seq_id) || lctx.kv_self.cells[i].pos > pos) {

f = -INFINITY;

} else {

if (hparams.use_alibi) {

f = -fabs(lctx.kv_self.cells[i].pos - pos);

} else {

f = 0.0f;

}

}

data[h*(n_kv*n_tokens) + j*n_kv + i] = f;

}

}

We store it in KQ_mask. The slope is just the head-specific hyper parameter m from the link

ah ok, good thanks!

ggml-ci

ggerganov · 2024-05-10T12:21:33Z

I think this is ready. @JoanFM, I will now rebase the Jina branch on top of this branch and will adapt to the changes. Will ping you when ready so we can do some tests and verify that this works. If it is good, will mark this PR ready for review and proceed

ggerganov · 2024-05-10T12:32:31Z

Updated the branch in #6826 and my embedding tests using Jina worked correctly, so it is ready for further tests. Let me know if you spot something that is not right

JoanFM · 2024-05-10T12:38:35Z

Updated the branch in #6826 and my embedding tests using Jina worked correctly, so it is ready for further tests. Let me know if you spot something that is not right

I have done some tests on my end and seems to work fine

ggml.c

JoanFM · 2024-05-10T12:47:27Z

ggml.c

+    float scale    = 1.0f;
+    float max_bias = 0.0f;
+
+    memcpy(&scale,    (float *) dst->op_params + 0, sizeof(float));


just to confirm, this is needed because to set params we also do a memcpy instead of making op_params float[] for alignment reasons right?

Yes, correct

NeoZhangJianyu

I test the CI of SYCL on Intel GPU.
The quality is OK!
soft_max is same as base.

ggml-ci

JohannesGaessler · 2024-05-11T12:43:47Z

I think the total amount of work needed for conflict resolution would have been lower if #7188 had been merged first but what's done is done.

github-actions · 2024-05-11T14:42:41Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 548 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8511.54ms p(95)=20878.8ms fails=, finish reason: stop=484 truncated=64
Prompt processing (pp): avg=91.75tk/s p(95)=403.42tk/s
Token generation (tg): avg=34.45tk/s p(95)=45.69tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/refactor-alibi-2 commit=03e940cdec1b91b848b9652e61cc3a2f4541d171

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715437926 --> 1715438554
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 268.87, 268.87, 268.87, 268.87, 268.87, 825.62, 825.62, 825.62, 825.62, 825.62, 764.56, 764.56, 764.56, 764.56, 764.56, 781.89, 781.89, 781.89, 781.89, 781.89, 784.49, 784.49, 784.49, 784.49, 784.49, 820.08, 820.08, 820.08, 820.08, 820.08, 815.28, 815.28, 815.28, 815.28, 815.28, 829.67, 829.67, 829.67, 829.67, 829.67, 841.41, 841.41, 841.41, 841.41, 841.41, 856.3, 856.3, 856.3, 856.3, 856.3, 881.99, 881.99, 881.99, 881.99, 881.99, 877.05, 877.05, 877.05, 877.05, 877.05, 880.19, 880.19, 880.19, 880.19, 880.19, 920.3, 920.3, 920.3, 920.3, 920.3, 909.46, 909.46, 909.46, 909.46, 909.46, 918.1, 918.1, 918.1, 918.1, 918.1, 914.92, 914.92, 914.92, 914.92, 914.92, 905.15, 905.15, 905.15, 905.15, 905.15, 900.11, 900.11, 900.11, 900.11, 900.11, 901.53, 901.53, 901.53, 901.53, 901.53, 905.63, 905.63, 905.63, 905.63, 905.63, 905.48, 905.48, 905.48, 905.48, 905.48, 916.78, 916.78, 916.78, 916.78, 916.78, 917.96, 917.96, 917.96, 917.96, 917.96, 919.22, 919.22, 919.22, 919.22, 919.22, 919.33, 919.33, 919.33, 919.33, 919.33, 925.43, 925.43, 925.43, 925.43, 925.43, 921.28, 921.28, 921.28, 921.28, 921.28, 917.13, 917.13, 917.13, 917.13, 917.13, 918.87, 918.87, 918.87, 918.87, 918.87, 921.24, 921.24, 921.24, 921.24, 921.24, 918.79, 918.79, 918.79, 918.79, 918.79, 920.91, 920.91, 920.91, 920.91, 920.91, 933.22, 933.22, 933.22, 933.22, 933.22, 936.03, 936.03, 936.03, 936.03, 936.03, 943.47, 943.47, 943.47, 943.47, 943.47, 940.63, 940.63, 940.63, 940.63, 940.63, 938.4, 938.4, 938.4, 938.4, 938.4, 938.21, 938.21, 938.21, 938.21, 938.21, 939.34, 939.34, 939.34, 939.34, 939.34, 937.96, 937.96, 937.96, 937.96, 937.96, 936.02, 936.02, 936.02, 936.02, 936.02, 892.84, 892.84, 892.84, 892.84, 892.84, 891.34, 891.34, 891.34, 891.34, 891.34, 889.19, 889.19, 889.19, 889.19, 889.19, 885.5, 885.5, 885.5, 885.5, 885.5, 885.79, 885.79, 885.79, 885.79, 885.79, 884.88, 884.88, 884.88, 884.88, 884.88, 887.41, 887.41, 887.41, 887.41, 887.41, 886.0, 886.0, 886.0, 886.0, 886.0, 888.07, 888.07, 888.07, 888.07, 888.07, 890.82, 890.82, 890.82, 890.82, 890.82, 889.83, 889.83, 889.83, 889.83, 889.83, 895.41, 895.41, 895.41, 895.41, 895.41, 895.98, 895.98, 895.98, 895.98, 895.98, 895.17, 895.17, 895.17, 895.17, 895.17, 895.56, 895.56, 895.56, 895.56, 895.56, 894.71, 894.71, 894.71, 894.71, 894.71, 896.62, 896.62, 896.62, 896.62, 896.62, 898.25, 898.25, 898.25, 898.25, 898.25, 897.75, 897.75, 897.75, 897.75, 897.75]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715437926 --> 1715438554
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.24, 41.24, 41.24, 41.24, 41.24, 37.47, 37.47, 37.47, 37.47, 37.47, 28.91, 28.91, 28.91, 28.91, 28.91, 28.17, 28.17, 28.17, 28.17, 28.17, 29.23, 29.23, 29.23, 29.23, 29.23, 30.2, 30.2, 30.2, 30.2, 30.2, 31.35, 31.35, 31.35, 31.35, 31.35, 32.65, 32.65, 32.65, 32.65, 32.65, 33.25, 33.25, 33.25, 33.25, 33.25, 33.56, 33.56, 33.56, 33.56, 33.56, 33.95, 33.95, 33.95, 33.95, 33.95, 33.8, 33.8, 33.8, 33.8, 33.8, 32.76, 32.76, 32.76, 32.76, 32.76, 32.36, 32.36, 32.36, 32.36, 32.36, 31.98, 31.98, 31.98, 31.98, 31.98, 31.74, 31.74, 31.74, 31.74, 31.74, 32.13, 32.13, 32.13, 32.13, 32.13, 32.2, 32.2, 32.2, 32.2, 32.2, 31.62, 31.62, 31.62, 31.62, 31.62, 31.46, 31.46, 31.46, 31.46, 31.46, 31.54, 31.54, 31.54, 31.54, 31.54, 31.63, 31.63, 31.63, 31.63, 31.63, 31.76, 31.76, 31.76, 31.76, 31.76, 31.57, 31.57, 31.57, 31.57, 31.57, 31.6, 31.6, 31.6, 31.6, 31.6, 31.68, 31.68, 31.68, 31.68, 31.68, 31.35, 31.35, 31.35, 31.35, 31.35, 31.11, 31.11, 31.11, 31.11, 31.11, 30.89, 30.89, 30.89, 30.89, 30.89, 31.2, 31.2, 31.2, 31.2, 31.2, 31.28, 31.28, 31.28, 31.28, 31.28, 31.39, 31.39, 31.39, 31.39, 31.39, 31.59, 31.59, 31.59, 31.59, 31.59, 31.57, 31.57, 31.57, 31.57, 31.57, 31.37, 31.37, 31.37, 31.37, 31.37, 31.23, 31.23, 31.23, 31.23, 31.23, 30.93, 30.93, 30.93, 30.93, 30.93, 30.95, 30.95, 30.95, 30.95, 30.95, 31.12, 31.12, 31.12, 31.12, 31.12, 31.21, 31.21, 31.21, 31.21, 31.21, 31.39, 31.39, 31.39, 31.39, 31.39, 31.4, 31.4, 31.4, 31.4, 31.4, 31.34, 31.34, 31.34, 31.34, 31.34, 30.58, 30.58, 30.58, 30.58, 30.58, 30.49, 30.49, 30.49, 30.49, 30.49, 29.56, 29.56, 29.56, 29.56, 29.56, 29.55, 29.55, 29.55, 29.55, 29.55, 29.6, 29.6, 29.6, 29.6, 29.6, 29.77, 29.77, 29.77, 29.77, 29.77, 29.93, 29.93, 29.93, 29.93, 29.93, 29.94, 29.94, 29.94, 29.94, 29.94, 29.91, 29.91, 29.91, 29.91, 29.91, 29.78, 29.78, 29.78, 29.78, 29.78, 29.7, 29.7, 29.7, 29.7, 29.7, 29.73, 29.73, 29.73, 29.73, 29.73, 29.82, 29.82, 29.82, 29.82, 29.82, 29.98, 29.98, 29.98, 29.98, 29.98, 30.08, 30.08, 30.08, 30.08, 30.08, 30.12, 30.12, 30.12, 30.12, 30.12, 30.14, 30.14, 30.14, 30.14, 30.14, 30.12, 30.12, 30.12, 30.12, 30.12]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715437926 --> 1715438554
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.38, 0.38, 0.38, 0.38, 0.38, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.35, 0.35, 0.35, 0.35, 0.35, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.54, 0.54, 0.54, 0.54, 0.54, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.11, 0.11, 0.11, 0.11, 0.11, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715437926 --> 1715438554
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0]

ggerganov mentioned this pull request May 10, 2024

feat: add potential to run Jina Embeddings architecture #6826

Merged

5 tasks

ggerganov added 2 commits May 10, 2024 11:16

ggml : full ALiBi support

7fdca33

ggml : update ggml_soft_max_ext() CUDA, SYCL

d0592d4

ggerganov force-pushed the gg/refactor-alibi-2 branch from 922a5b3 to d0592d4 Compare May 10, 2024 08:17

ggml : ggml_flash_attn_ext() support ALiBi (CPU)

166e60b

ggerganov force-pushed the gg/refactor-alibi-2 branch from a4c7cf7 to 166e60b Compare May 10, 2024 08:48

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs enhancement New feature or request model Model specific labels May 10, 2024

ggml : ggml_flash_attn_ext() support ALiBi (Metal)

97c27f5

ggerganov force-pushed the gg/refactor-alibi-2 branch from ba4d12a to 97c27f5 Compare May 10, 2024 10:58

ggerganov commented May 10, 2024

View reviewed changes

ggml : fix warning

f7055d3

JoanFM reviewed May 10, 2024

View reviewed changes

ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

865af99

ggml-ci

ggerganov marked this pull request as ready for review May 10, 2024 12:32

ggerganov requested a review from slaren May 10, 2024 12:33

JoanFM suggested changes May 10, 2024

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

ggml : fix assert message

536983b

JoanFM reviewed May 10, 2024

View reviewed changes

vulkan : add dev notes

397b1f8

NeoZhangJianyu approved these changes May 10, 2024

View reviewed changes

slaren approved these changes May 10, 2024

View reviewed changes

ggml : require mask when using ALiBi

0faf92e

ggml-ci

ggerganov force-pushed the gg/refactor-alibi-2 branch from a616605 to 0faf92e Compare May 10, 2024 14:21

mofosyne added the refactoring Refactoring label May 10, 2024

convert : fix convert for refact models

03e940c

ggerganov merged commit 9cb317f into master May 11, 2024
59 of 64 checks passed

0cc4m mentioned this pull request May 12, 2024

Update and fix Vulkan soft_max and argsort implementations #7237

Merged

ggerganov mentioned this pull request May 21, 2024

metal : handle -inf values in FA kernel #7434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : full ALiBi support #7192

ggml : full ALiBi support #7192

ggerganov commented May 10, 2024 •

edited

Loading

ggerganov May 10, 2024 •

edited

Loading

JoanFM May 10, 2024

ggerganov May 10, 2024

JoanFM May 10, 2024

ggerganov commented May 10, 2024

ggerganov commented May 10, 2024 •

edited

Loading

JoanFM commented May 10, 2024

JoanFM May 10, 2024

ggerganov May 10, 2024

NeoZhangJianyu left a comment

JohannesGaessler commented May 11, 2024

github-actions bot commented May 11, 2024

		// TODO: is there a better way to handle -INFINITY?
		dst_data[i00] = src[0] == -INFINITY ? -MAXHALF : src[0];

	// For causal attention, use only the previous KV cells
	// of the correct sequence for each token of the batch.
	// It's assumed that if a token in the batch has multiple sequences, they are equivalent.
	for (int h = 0; h < 1; ++h) {
	for (int j = 0; j < n_tokens; ++j) {
	const llama_pos pos = batch.pos[j];
	const llama_seq_id seq_id = batch.seq_id[j][0];

	for (int i = 0; i < n_kv; ++i) {
	float f;
	if (!lctx.kv_self.cells[i].has_seq_id(seq_id) \|\| lctx.kv_self.cells[i].pos > pos) {
	f = -INFINITY;
	} else {
	if (hparams.use_alibi) {
	f = -fabs(lctx.kv_self.cells[i].pos - pos);
	} else {
	f = 0.0f;
	}
	}
	data[h(n_kvn_tokens) + j*n_kv + i] = f;
	}
	}

ggml : full ALiBi support #7192

ggml : full ALiBi support #7192

Conversation

ggerganov commented May 10, 2024 • edited Loading

Worflow

Tests

ggerganov May 10, 2024 • edited Loading

Choose a reason for hiding this comment

JoanFM May 10, 2024

Choose a reason for hiding this comment

ggerganov May 10, 2024

Choose a reason for hiding this comment

JoanFM May 10, 2024

Choose a reason for hiding this comment

ggerganov commented May 10, 2024

ggerganov commented May 10, 2024 • edited Loading

JoanFM commented May 10, 2024

JoanFM May 10, 2024

Choose a reason for hiding this comment

ggerganov May 10, 2024

Choose a reason for hiding this comment

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

JohannesGaessler commented May 11, 2024

github-actions bot commented May 11, 2024

ggerganov commented May 10, 2024 •

edited

Loading

ggerganov May 10, 2024 •

edited

Loading

ggerganov commented May 10, 2024 •

edited

Loading