-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
macos / ARM support for vllm #2244
base: main
Are you sure you want to change the base?
Conversation
Is there scope to make use of metal backend to improve performance? |
My original goal for this project was to spend 1 day to get something that I can use for development on macos. Using CPU vector instructions was sufficient to achieve my goal, so I will not promise to put effort into developing a metal backend. One good thing this change does is it adds a --device command line option: in theory, it might be possible to pass The good news is there is now a Edit: I want to be clear and temper expectations. Once again, my goal was merely to be able to run VLLM locally and speed up development. This code does not implement GPTQ support, so in terms of CPU execution, you would be much better served by using llama.cpp which can achieve 20 TPS of Mistral-7B at 4bit on a macbook, rather than this version which achieves 0.7 TPS as bfloat16. |
Np, thanks for explaining |
hi @pathorn , have you looked at https://github.com/ml-explore/mlx ? I think it will give better performance running on mac without much effort and will utilize Unified Memory on Apple Silicon devices. Do you think it's not a complex task to add a wrapper for mlx and use it as a vllm backend for mac devices? I'm looking into that but my limit understanding of ML dev make it look like not an easy task for me. They have many examples here: https://github.com/ml-explore/mlx-examples including running inference on mac. What I can do best propably using the interface from the wrapper and connect it with the OpenAI API endpoints. |
Is this issue resolved and merged? I am trying to install vllm package on my macM1 via pip and I get the following error:
|
+1 |
Rebased onto latest vllm. Tested with OPT-175B Still missing some ops such as |
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: "jiang1.li" <jiang1.li@intel.com>
Rebased and updated with gelu ops for Gemma. google/gemma-2b runs pretty well on my macbook. Updated the build instructions. Does not currently install via pip due to need to add the |
The mlc-llm/tvm is perhaps more suited for this. Tested on M1 with really good performance. |
Built on top of a rebased version of:
Build instructions:
Make sure to install openmp from https://mac.r-project.org/openmp/
Not using docker. Tested in miniconda3 env
To build:
Just the .so:
Build the full thing:
(NOTE: The .so file will not load on macos if your current working directory is the repository root. Change directory before running.)
Run the api server with
Testing
Tested a few prompts on Mistral-7B and verified against the output returned by DeepInfra.
Mistral 7B gets about 0.6 tokens per second on my plain Macbook M3. It's not going to be as fast as a dedicated GPU, and does not support GPTQ (such as 4bit quantization), so llama.cpp knocks this PR out of the park in terms of performance, getting around 20 TPS on the same macbook hardware.
I do also hope this PR serves as documentation for some of the new bfloat16 ARM instructions added in recent years, for example
vcvtq_high_bf16_f32(vcvtq_low_bf16_f32(a), b)
to convert between bf16 and fp32, or the implementation of fused multiply-add which took a few hours of research, due to thevbfmlaltq
andvbfmlalbq
bfloat instructions (The ARM equivalent of_mm512_dpbf16_ps
("DPBF16PS") on AVX512 being almost completely ungooglable. I think I was a bit intrigued to be one of the first people to use these ARMv8.6-a+bf16 instructions or write about them:Fixes