Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439

Merged
merged 1 commit into from
Mar 23, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Mar 23, 2023

ref #249 #95

See #407 (comment)

In the mul_mat() implementation currently we have 2 main branches:

  • src0 is contiguous in memory (code)
  • src0 is not contiguous in memory (code)

In the first branch we parallelize the computation along the src0 rows. Each thread computes a dot product of src0 row with src1 column and writes the result into a cell of dst.

In the second branch we parallelize along the src1 columns. Each thread computes multiply + add (mad) of a src0 column with an element from src1 and writes the result into a per-thread temporary buffer row. At the end of the multiplication, the results from all temporary buffers are accumulated into dst.

The second branch produces variation in the final result based on the used number of threads, since the result into a single dst cell is computed by adding different number of floating point terms, based on the used number of threads. It is a bit more efficient, but also uses a lot more memory due to the temporary buffers.

I am thinking that in view of having more stable results and also simplifying significantly the code in ggml.c, we should eliminate this second branch. The solution is to always make sure that src0 is contiguous, which the user can always achieve with a simple ggml_cpy() call.

The benefits are quite a lot:

  • no need to maintain ggml_vec_mad_xxx() functions - can be simply deleted
  • reproducible results for different number of threads
  • simpler ggml_forward_mul_mat_xxx() implementations
  • less memory usage

The drawbacks:

  • very slight performance degradation

@ggerganov ggerganov changed the title Avoid the transposed X branch in the Z = X * Y matrix multiplication Avoid the "transposed X" branch in the Z = X * Y matrix multiplication Mar 23, 2023
@ggerganov ggerganov marked this pull request as ready for review March 23, 2023 19:33
@ggerganov ggerganov changed the title Avoid the "transposed X" branch in the Z = X * Y matrix multiplication Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication Mar 23, 2023
@ggerganov
Copy link
Owner Author

@blackhole89 bringing your attention to this

@IlyaBizyaev
Copy link

For me, this change increases the memory usage so drastically that main crashes with ggml_new_tensor_impl: not enough space in the context's memory pool during initial prompt (there's 64G, and before the change just 8 is enough).

Ubuntu 22.04.2 LTS, GCC 12

KerfuffleV2 added a commit to KerfuffleV2/llama-rs that referenced this pull request Mar 24, 2023
KerfuffleV2 added a commit to KerfuffleV2/llama-rs that referenced this pull request Mar 26, 2023
philpax pushed a commit to rustformers/llm that referenced this pull request Mar 26, 2023
ocrickard added a commit to ocrickard/llama.cpp that referenced this pull request Apr 4, 2023
ocrickard added a commit to ocrickard/llama.cpp that referenced this pull request Apr 4, 2023
ocrickard added a commit to ocrickard/llama.cpp that referenced this pull request Apr 4, 2023
Try to optimize ggml a bit

Revert "Try to optimize ggml a bit"

This reverts commit 661ec69.

Revert "Avoid the transposed X branch in the Z = X * Y matrix multiplication (ggerganov#439)"

This reverts commit 483bab2.

Fix missing memory_v

Revert "Fix missing memory_v"

This reverts commit 375bd60.

Revert "Revert "Avoid the transposed X branch in the Z = X * Y matrix multiplication (ggerganov#439)""

This reverts commit fd19b4a.
sw added a commit to sw/llama.cpp that referenced this pull request Apr 4, 2023
AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants