~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #616

izard · 2023-03-30T04:01:52Z

izard
Mar 30, 2023

I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for state_shared.has_work in while loops than doing actual work in matrix multiply.
So I changed busy waits like:

pthread_mutex_lock(&state->shared->mutex);
   while (state->shared->has_work) {
     pthread_cond_wait(&state->shared->cond, &state->shared->mutex);
// unlock

and setting has_work to

pthread_mutex_lock(&state_shared.mutex);
state_shared.has_work = true;
pthread_cond_broadcast(&state_shared.cond);
pthread_mutex_unlock(&state_shared.mutex);

Got a nice 2x speedup in time/token.

I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.

linouxis9 · 2023-03-30T07:16:38Z

linouxis9
Mar 30, 2023

That's great to hear! I think such change could also help on Intel 12th and 13th gen as I theorized here: #572 (comment)
I'll try to play a bit with that today, thanks for the idea :-)

3 replies

ggerganov Mar 30, 2023
Maintainer

Here are some additional efforts in this regard:

Attempt to improve threading in ggml whisper.cpp#343
Here I tried to have the threads created once per context instead of creating and joining them for each eval. Not much success but maybe worth looking into it again
"Double" the performance whisper.cpp#659
Claims of improved performance, but seems to still have a race condiition / data race which causes dead-lock on some systems. Haven't had the time to look into the changes yet, but could be the right way to go

Overall, I am almost sure that the threading in ggml has a lot of room for improvement, but somehow I haven't found a way to achieve it. It's also tricky, because I think on some systems you might observe improvement with one approach, but it could actually degrade the performance on another (e.g. Windows vs Linux). But not super confident about that last statement

izard Mar 30, 2023
Author

Just checked on highest core count Mac Studio. Unlike Mac Book Pro with 8 P-cores (where 7 threads provides best perf at 98 ms/token with locks, and 220 with atomics), performance is slightly worse with locks than with atomics with 15 threads on a 16 P-cores machine. So I think there must be some other way. Apple Silicon CPUs, especially with dual-die config have worse atomic scaling performance than x86.

Regarding deadlocks potential, this change preserves correctness of original approach, just replacing busy spinning with return to OS. I tested with semi-automated scripts on half a dozen Apple Silicon machines with different threads and core counts, no deadlocks occurred (except some occasional strange generated output with 1 thread, but I am not sure if it is threading to blame).

izard Mar 31, 2023
Author

Tested more with different model sizes, different prompts, and Linux OS. 2x on MBP was an outlier. Now I see different configs have different speedups/slowdowns. So the change as is cannot be suggested, though it is a place to look at in further perf analysis.

gjmulder · 2023-03-30T19:18:53Z

gjmulder
Mar 30, 2023
Collaborator

Created an enhancement issue for this #633

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #616

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #616

izard Mar 30, 2023

Replies: 2 comments · 3 replies

linouxis9 Mar 30, 2023

ggerganov Mar 30, 2023 Maintainer

izard Mar 30, 2023 Author

izard Mar 31, 2023 Author

gjmulder Mar 30, 2023 Collaborator

izard
Mar 30, 2023

Replies: 2 comments 3 replies

linouxis9
Mar 30, 2023

ggerganov Mar 30, 2023
Maintainer

izard Mar 30, 2023
Author

izard Mar 31, 2023
Author

gjmulder
Mar 30, 2023
Collaborator