~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

gjmulder · 2023-03-30T19:18:14Z

Discussed in #616

^{Originally posted by izard March 30, 2023}
I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for state_shared.has_work in while loops than doing actual work in matrix multiply.
So I changed busy waits like:

pthread_mutex_lock(&state->shared->mutex);
   while (state->shared->has_work) {
     pthread_cond_wait(&state->shared->cond, &state->shared->mutex);
// unlock

and setting has_work to

pthread_mutex_lock(&state_shared.mutex);
state_shared.has_work = true;
pthread_cond_broadcast(&state_shared.cond);
pthread_mutex_unlock(&state_shared.mutex);

Got a nice 2x speedup in time/token.

I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.

The text was updated successfully, but these errors were encountered:

izard · 2023-03-31T02:08:13Z

Tested more with different model sizes, different prompts, and Linux OS. 2x on MBP was an outlier. Now I see different configs have different speedups/slowdowns. So the change as it is cannot be suggested, though it is a place to look at in further perf analysis.

prusnak · 2023-03-31T08:43:42Z

Tested more with different model sizes,

Can you create a draft pull request anyway (with a note you don't want to merge, just sharing the code with the others), so we can test as well?

bogdad · 2023-03-31T19:58:40Z

^ did give it a try - works a bit slower on my machine 7B and very small n_predict, but maybe it can be improved - i did a mechanical rewrite without much thinking. its a draft and does not work on windows, so hiding it in my fork, feel free to use : )

bogdad · 2023-04-02T12:20:38Z

#710 - this one is a try to change to an existing c thread pool, slight eval timings drop, but cpu usage gets from 700% to 400% on 8 threads, guess cool for mobile usage. also the pr has timings, hope thats useful

GGUF (Breaking Change to Model Files)

github-actions · 2024-04-12T01:07:15Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

gjmulder added enhancement New feature or request performance Speed related topics labels Mar 30, 2023

bogdad mentioned this issue Mar 31, 2023

lock instead of spinlock bogdad/llama.cpp#3

Draft

ggerganov added this to ggml : improve threading implementation Apr 14, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Merge pull request ggerganov#633 from abetlen/gguf

915bbea

GGUF (Breaking Change to Model Files)

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

gjmulder commented Mar 30, 2023

izard commented Mar 31, 2023

prusnak commented Mar 31, 2023

bogdad commented Mar 31, 2023

bogdad commented Apr 2, 2023 •

edited

Loading

github-actions bot commented Apr 12, 2024

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

Comments

gjmulder commented Mar 30, 2023

Discussed in #616

izard commented Mar 31, 2023

prusnak commented Mar 31, 2023

bogdad commented Mar 31, 2023

bogdad commented Apr 2, 2023 • edited Loading

github-actions bot commented Apr 12, 2024

bogdad commented Apr 2, 2023 •

edited

Loading