Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add llama_matmul_demo2_bf16.c with other parallelize experiment #586

Open
wants to merge 1 commit into
base: llama-matmul
Choose a base branch
from

Conversation

Djip007
Copy link
Contributor

@Djip007 Djip007 commented Oct 12, 2024

You was not too far from good speed.

But if you look at BLIS paper, bloc_A need to be keep in L2 cache, on x86 CPU (zen ...) there is 1 L2 cache per core, so each bloc compute on a core need is own bloc_A.

With this demo I have speed up of 1.77 on my zen4 8 core ( AMD Ryzen 9 7940HS )

Note: there is more to do for best perf:

  • B is broadcast so it is not needed to transpose it I think.
  • bloc_B need to be keep in L3 cache, it is the case on my 8 core zen4, but not on the 16+ core zen4 and zen2 (1 L3 for 4 core...) for best we can have bloc_B per L3 cache
  • Next, if N in big enough we can parallelize on first loop
for (int j = ith * NC; j < N; j += NT) { 
    [...]
}

with this one we keep bloc_A on L2 cache
bloc_B is shared and keep in L3/L1 cache
@Djip007
Copy link
Contributor Author

Djip007 commented Oct 12, 2024

I finally found some time to make some "comments" on this branch.

I do not change existing code, juste add a demo to show what we can have.
It may not be perfectly clean, but hope useful for this experiment.

Note: I use this 5 loop for my fp8 branch. but for best performance I repack A on weight load so no more need to do it in the compute part. But it need to create a completly new backend for have control on the backend_buffer...

@Djip007
Copy link
Contributor Author

Djip007 commented Oct 13, 2024

1 more advantage with the use of bloc_A/blob_B we can (if we need) do the dequantise with the pack_ that made there use suitable for the CPU gemm compute.

That way it may be easy to accepte more input type...

with some more test I get the best speed on my CPU (x2,05) with

#define MR 16
#define NR 16
#define MC MR*16
#define NC NR*64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant