~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #616
izard
started this conversation in
Show and tell
Replies: 2 comments 3 replies
-
That's great to hear! I think such change could also help on Intel 12th and 13th gen as I theorized here: #572 (comment) |
Beta Was this translation helpful? Give feedback.
3 replies
-
Created an enhancement issue for this #633 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for
state_shared.has_work
in while loops than doing actual work in matrix multiply.So I changed busy waits like:
and setting
has_work
toGot a nice 2x speedup in time/token.
I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.
Beta Was this translation helpful? Give feedback.
All reactions