ggml : multi-thread ggml_rope() (~3-4 times faster on M1) #781

ggerganov · 2023-04-05T16:17:01Z

`🤖 Generated by Copilot at 625f212`

Summary

🧶🚀🧑‍💻

This pull request improves the performance of the ggml library by parallelizing the rope operation on tensors, and modifying the graph executor to handle parallel tasks. It affects the file ggml.c.

We'll hoist the sail with a rope operation
ggml_compute_forward_rope_f32 and f16
We'll work in parallel, no more hesitation
On the count of three, pull hard, me friends

Walkthrough

Parallelize the rope operation for both f32 and f16 data types by dividing the input rows among the available threads (link, link, link, link)
Remove the invalid assertions that the thread index is zero from the ggml_compute_forward_rope_f32 and ggml_compute_forward_rope_f16 functions (link, link)
Set the number of tasks for the rope operation node to the number of threads in the ggml.c file (link)

prusnak

Tested on M1.

Before the change (master 5a8c4f6) - second run:

llama_print_timings:        load time =  2294.47 ms
llama_print_timings:      sample time =   116.82 ms /   128 runs   (    0.91 ms per run)
llama_print_timings: prompt eval time =  2090.16 ms /     8 tokens (  261.27 ms per token)
llama_print_timings:        eval time = 23215.24 ms /   127 runs   (  182.80 ms per run)
llama_print_timings:       total time = 25628.04 ms

After the change (commit 625f212) - second run:

llama_print_timings:        load time =  1460.12 ms
llama_print_timings:      sample time =    94.96 ms /   128 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =  1272.30 ms /     8 tokens (  159.04 ms per token)
llama_print_timings:        eval time = 21661.06 ms /   127 runs   (  170.56 ms per run)
llama_print_timings:       total time = 23217.51 ms

howard0su · 2023-04-05T17:23:58Z

ggml.c

    for (int64_t i3 = 0; i3 < ne3; i3++) {
        for (int64_t i2 = (mode == 0 ? 0 : n_past); i2 < ne2; i2++) {
            const int p = (mode == 0 ? n_past + i2 : i2);
            for (int64_t i1 = 0; i1 < ne1; i1++) {
+                if (ir++ < ir0) continue;
+                if (ir   > ir1) break;
+
                for (int i0 = 0; i0 < n_dims; i0 += 2) {
                    const float theta = powf(10000.0, ((float)-i0)/n_dims);


theta can be calculated as theta *= factor in each loop. factor can be calculated out of the outer loop factor = powf(10000.0, ((float)-2)/n_dims); the initial theta = p.

Open a PR if you observe performance gain improvement

fatemehkhoramdel · 2024-10-16T13:07:33Z

How can i use Multi thread llama on cpu?

ggerganov mentioned this pull request Apr 5, 2023

Multi-thread ggml_cpy() #782

Closed

prusnak approved these changes Apr 5, 2023

View reviewed changes

howard0su reviewed Apr 5, 2023

View reviewed changes

Base automatically changed from fix-cpy to master April 5, 2023 19:07

ggml : multi-thread ggml_rope() (~3-4 times faster on M1)

372b70c

ggerganov force-pushed the rope-mt branch from 625f212 to 372b70c Compare April 5, 2023 19:10

ggerganov merged commit eeaa7b0 into master Apr 5, 2023

ggerganov deleted the rope-mt branch April 5, 2023 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : multi-thread ggml_rope() (~3-4 times faster on M1) #781

ggml : multi-thread ggml_rope() (~3-4 times faster on M1) #781

ggerganov commented Apr 5, 2023 •

edited by ghost

Loading

prusnak left a comment

howard0su Apr 5, 2023

ggerganov Apr 5, 2023

fatemehkhoramdel commented Oct 16, 2024

ggml : multi-thread ggml_rope() (~3-4 times faster on M1) #781

ggml : multi-thread ggml_rope() (~3-4 times faster on M1) #781

Conversation

ggerganov commented Apr 5, 2023 • edited by ghost Loading

🤖 Generated by Copilot at 625f212

Summary

Walkthrough

prusnak left a comment

Choose a reason for hiding this comment

howard0su Apr 5, 2023

Choose a reason for hiding this comment

ggerganov Apr 5, 2023

Choose a reason for hiding this comment

fatemehkhoramdel commented Oct 16, 2024

ggerganov commented Apr 5, 2023 •

edited by ghost

Loading

`🤖 Generated by Copilot at 625f212`