Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

Closed
gjmulder opened this issue Mar 30, 2023 Discussed in #616 · 5 comments
Labels
enhancement New feature or request performance Speed related topics stale

Comments

@gjmulder
Copy link
Collaborator

Discussed in #616

Originally posted by izard March 30, 2023
I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for state_shared.has_work in while loops than doing actual work in matrix multiply.
So I changed busy waits like:

pthread_mutex_lock(&state->shared->mutex);
   while (state->shared->has_work) {
     pthread_cond_wait(&state->shared->cond, &state->shared->mutex);
// unlock

and setting has_work to

pthread_mutex_lock(&state_shared.mutex);
state_shared.has_work = true;
pthread_cond_broadcast(&state_shared.cond);
pthread_mutex_unlock(&state_shared.mutex);

Got a nice 2x speedup in time/token.

I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.

@gjmulder gjmulder added enhancement New feature or request performance Speed related topics labels Mar 30, 2023
@izard
Copy link

izard commented Mar 31, 2023

Tested more with different model sizes, different prompts, and Linux OS. 2x on MBP was an outlier. Now I see different configs have different speedups/slowdowns. So the change as it is cannot be suggested, though it is a place to look at in further perf analysis.

@prusnak
Copy link
Collaborator

prusnak commented Mar 31, 2023

Tested more with different model sizes,

Can you create a draft pull request anyway (with a note you don't want to merge, just sharing the code with the others), so we can test as well?

@bogdad
Copy link
Contributor

bogdad commented Mar 31, 2023

^ did give it a try - works a bit slower on my machine 7B and very small n_predict, but maybe it can be improved - i did a mechanical rewrite without much thinking. its a draft and does not work on windows, so hiding it in my fork, feel free to use : )

@bogdad
Copy link
Contributor

bogdad commented Apr 2, 2023

#710 - this one is a try to change to an existing c thread pool, slight eval timings drop, but cpu usage gets from 700% to 400% on 8 threads, guess cool for mobile usage. also the pr has timings, hope thats useful

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
GGUF (Breaking Change to Model Files)
@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Speed related topics stale
Projects
No open projects
Development

No branches or pull requests

4 participants