Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : multi-thread ggml_rope() (~3-4 times faster on M1) #781

Merged
merged 1 commit into from
Apr 5, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 5, 2023

πŸ€– Generated by Copilot at 625f212

Summary

πŸ§ΆπŸš€πŸ§‘β€πŸ’»

This pull request improves the performance of the ggml library by parallelizing the rope operation on tensors, and modifying the graph executor to handle parallel tasks. It affects the file ggml.c.

We'll hoist the sail with a rope operation
ggml_compute_forward_rope_f32 and f16
We'll work in parallel, no more hesitation
On the count of three, pull hard, me friends

Walkthrough

  • Parallelize the rope operation for both f32 and f16 data types by dividing the input rows among the available threads (link, link, link, link)
  • Remove the invalid assertions that the thread index is zero from the ggml_compute_forward_rope_f32 and ggml_compute_forward_rope_f16 functions (link, link)
  • Set the number of tasks for the rope operation node to the number of threads in the ggml.c file (link)

@ggerganov ggerganov mentioned this pull request Apr 5, 2023
Copy link
Collaborator

@prusnak prusnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on M1.

Before the change (master 5a8c4f6) - second run:

llama_print_timings:        load time =  2294.47 ms
llama_print_timings:      sample time =   116.82 ms /   128 runs   (    0.91 ms per run)
llama_print_timings: prompt eval time =  2090.16 ms /     8 tokens (  261.27 ms per token)
llama_print_timings:        eval time = 23215.24 ms /   127 runs   (  182.80 ms per run)
llama_print_timings:       total time = 25628.04 ms

After the change (commit 625f212) - second run:

llama_print_timings:        load time =  1460.12 ms
llama_print_timings:      sample time =    94.96 ms /   128 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =  1272.30 ms /     8 tokens (  159.04 ms per token)
llama_print_timings:        eval time = 21661.06 ms /   127 runs   (  170.56 ms per run)
llama_print_timings:       total time = 23217.51 ms

for (int64_t i3 = 0; i3 < ne3; i3++) {
for (int64_t i2 = (mode == 0 ? 0 : n_past); i2 < ne2; i2++) {
const int p = (mode == 0 ? n_past + i2 : i2);
for (int64_t i1 = 0; i1 < ne1; i1++) {
if (ir++ < ir0) continue;
if (ir > ir1) break;

for (int i0 = 0; i0 < n_dims; i0 += 2) {
const float theta = powf(10000.0, ((float)-i0)/n_dims);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theta can be calculated as theta *= factor in each loop. factor can be calculated out of the outer loop factor = powf(10000.0, ((float)-2)/n_dims); the initial theta = p.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open a PR if you observe performance gain improvement

Base automatically changed from fix-cpy to master April 5, 2023 19:07
@fatemehkhoramdel
Copy link

How can i use Multi thread llama on cpu?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants