-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apple M1 metal lag #1730
Comments
What device are you running on? Unless it’s an M1 Ultra you should be running 10 or fewer threads. |
It's an M1 Ultra, it runs great with 20 threads to start, but only runs on 4 threads following the first GPU use. |
The pause occurs when the context becomes full. When this happens, we roughly pick the second half of the context and reprocess it in order to free-up half the context for new generation. The reprocessing currently does not use Metal, as we haven't implemented efficient Matrix x Matrix kernels. So we simply fallback to the standard non-GPU implementation. It currently runs on the CPU, while the heavy matrix multiplications are done with Apple Accelerate's CBLAS which allegedly utilizes the AMX coprocessor. Therefore, the CPU is barely occupied during this period as the AMX does the heavy lifting. The AMX utilization cannot be monitored with standard activity monitoring tools, so you won't see it in Activity Monitor. |
Are there any future plans to address this issue? Or does it even seem fixable? I just bought an m2 ultra with 128gb ram hoping it would be a great solution for LLM inference and this issue leaves me semi dead in the water -- nearly 10 tokens / second of generation averages out to less than 1 token / second taking this into account :( |
@ggerganov First thank you for the explanation, and thank you for initiating such a remarkable project here. @dogjamboree The latest builds of oobabooga/text-generation-ui address this performance. I recommend install it and abetlen/llama-cpp-python to drive it. Make sure to install with this:
When loading the model in text-generation-ui, make sure to set n_gpus > 1 and use all your threads. I'm getting 1421.63 tokens per second on sample time, 1.79 tokens per second for prompt eval time, and full eval takes 7.35 tokens per second. The CPU lag between GPU "bursts" is almost gone. Note that the above metrics are for Guanco 65B on an M1 Ultra. I suspect your chip will feature a 20% or so speed boost over my baseline. |
@leedrake5 @dogjamboree Worst case scenario, when we implement a fast Metal |
Not sure - I was able to generate 2k tokens with no interruption. Video here showing its performance in contrast to the command-line resource pattern shown in the OG post. The exact same 65B Guanco is used for both instances. I'm a bit mystified - command line I get the context swap pauses, but with text-generation-ui + llama_cpp-python they aren't visible in GPU/CPU utilization. So either there is some default text-generation-ui is passing that I am not using in command line, or llama-cpp-python has a unique solution. Here's the ggml_metal details in case there is anything useful :
|
The command-line tool uses a context of 512 tokens by default. You can increase this up to 2048 by using the Still when the 2048 context becomes full, it will do the swap and cause a pause |
@ggerganov llama-cpp-python (which text gen ui uses) implements the additional caching: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L865 |
@ggerganov Yup - that definitely opens up a lot more GPU usage, but no free lunch. Cacheing takes much longer for the exact same reason. I am curious if the @AlphaAtlas point about the self.cache property of llama-cpp-python is a workaround to cacheing in general, though this is far from my area of expertise. It looks like the definition of |
i still error on metal (rx 560, macos ventura 13.4) gml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x7f9fcd80d2e0 | th_max = 768 | th_width = 64 |
I also still in error on metal(Intel CPU, AMD 5500m, macos ventura 13.5.1, llama-cpp-python 0.1.83). |
llm_load_print_meta: model size = 13.02 B Any help would be appreciated. So for some reason we have the line: ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x0 | th_max = 0 | th_width = 0 Why is it trying to load a nullptr? That would explain why it all fails. The relevant error is from ggml-metal.m, line 209. There is a lot of Macro magic. Not sure why GGML_METAL_ADD_KERNEL(cpy_f32_f32); returns null, and why rope, alibi, and f32_f16 are skipped. |
@ZacharyDK probably it is the wrong file format ( |
I've downloaded the LLama 2 model from scratch and converted and quantized with:
Then I got exactly the same error on an Apple iMac (Intel):
When I disable the GPU usage with "--gpu-layers 0" the exactly same model works just fine. So, the problem with "There is a call to an undefined label" and the null has something to do with Metal/GPU support. |
Ah yes, this is mentioned here #3129 (comment) as well. One workaround is to disable metal and enable clblast, which not only gives you GPU acceleration (in my case 20x faster loading times – iMac Intel i5), but still the ability to offload layers to the GPU. |
I was using the quantized models, loaded with the llama_cpp python library. Unless something was fixed recently, you have to stick with C++. Llama_cpp python will keep trying to use metal even when you specify you don't want metal in the install settings... |
The original issue posted here has been resolved via #3228 |
Thanks for your reply. That's quite impressive and I'm going to try it right now! |
Prefacing that this isn't urgent. When using the recently added M1 GPU support, I see an odd behavior in system resource use. When using all threads -t 20, the first initialization follows the instruction. However when there is a pause in GPU use, only about 4 threads are used regardless of the tag.
Video showing response (Guanco 65B) and system resource use: https://youtu.be/ysA7xg6nevY
Apologies for the cringe prompt, but wanted to test accuracy (points for remembering Wayne was a founder, but Apple Watch was released in 2014, not 2015). Some parameters (batch size) are weird, but behavior is the same regardless of this integer.
The text was updated successfully, but these errors were encountered: