macos / ARM support for vllm #2244

pathorn · 2023-12-22T14:01:37Z

Built on top of a rebased version of:

[Feature] Prototype of vLLM execution on CPU-only devices #1028

Build instructions:

Make sure to install openmp from https://mac.r-project.org/openmp/

Not using docker. Tested in miniconda3 env

To build:

Just the .so:

VLLM_TARGET_DEVICE=cpu python3 setup.py build_ext

Build the full thing:

VLLM_TARGET_DEVICE=cpu python3 setup.py build
VLLM_TARGET_DEVICE=cpu python3 setup.py install
# Seems to give an error the first time.
# error: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

# I can't reproduce the error if I run it again. It does seem to run.
VLLM_TARGET_DEVICE=cpu python3 setup.py install

(NOTE: The .so file will not load on macos if your current working directory is the repository root. Change directory before running.)

Run the api server with

# Make sure you are not in the vllm repository root.
python3 -m vllm.entrypoints.openai.api_server --device cpu --enforce-eager --port 8000

Testing

Tested a few prompts on Mistral-7B and verified against the output returned by DeepInfra.

curl http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{"model":"mistralai/Mistral-7B-Instruct-v0.1","temperature":0.0,"stop":"side","prompt":"Why did the chicken cross the road?\n","stop":"side", "max_tokens":50}'; echo

Mistral 7B gets about 0.6 tokens per second on my plain Macbook M3. It's not going to be as fast as a dedicated GPU, and does not support GPTQ (such as 4bit quantization), so llama.cpp knocks this PR out of the park in terms of performance, getting around 20 TPS on the same macbook hardware.

I do also hope this PR serves as documentation for some of the new bfloat16 ARM instructions added in recent years, for example vcvtq_high_bf16_f32(vcvtq_low_bf16_f32(a), b) to convert between bf16 and fp32, or the implementation of fused multiply-add which took a few hours of research, due to the vbfmlaltq and vbfmlalbq bfloat instructions (The ARM equivalent of _mm512_dpbf16_ps ("DPBF16PS") on AVX512 being almost completely ungooglable. I think I was a bit intrigued to be one of the first people to use these ARMv8.6-a+bf16 instructions or write about them:

inline FP32Vec16 fma(BF16Vec32 &a, BF16Vec32 &b, FP32Vec16 &c) {
  return FP32Vec16(float32x4x4_t({
    vbfmlaltq_f32(vbfmlalbq_f32(c.reg.val[0], a.reg.val[0], b.reg.val[0]), a.reg.val[0], b.reg.val[0]),
    vbfmlaltq_f32(vbfmlalbq_f32(c.reg.val[1], a.reg.val[1], b.reg.val[1]), a.reg.val[1], b.reg.val[1]),
    vbfmlaltq_f32(vbfmlalbq_f32(c.reg.val[2], a.reg.val[2], b.reg.val[2]), a.reg.val[2], b.reg.val[2]),
    vbfmlaltq_f32(vbfmlalbq_f32(c.reg.val[3], a.reg.val[3], b.reg.val[3]), a.reg.val[3], b.reg.val[3]),
  }));
}

Fixes

nivibilla · 2023-12-24T01:29:15Z

Is there scope to make use of metal backend to improve performance?

pathorn · 2023-12-24T02:56:03Z

My original goal for this project was to spend 1 day to get something that I can use for development on macos. Using CPU vector instructions was sufficient to achieve my goal, so I will not promise to put effort into developing a metal backend.

One good thing this change does is it adds a --device command line option: in theory, it might be possible to pass --device mps . PyTorch already supports the mps backend, but you are going to have to figure out a plan for porting the cuda code in csrc/ to metal.

The good news is there is now a cpu implementation in csrc/ in addition to cuda, with a wrapper function named _dispatch. The dispatch function and two example implementations might make it easier to add an additional mps / metal implementation. But I am sorry to say that I do not have time to help with this.

Edit: I want to be clear and temper expectations. Once again, my goal was merely to be able to run VLLM locally and speed up development. This code does not implement GPTQ support, so in terms of CPU execution, you would be much better served by using llama.cpp which can achieve 20 TPS of Mistral-7B at 4bit on a macbook, rather than this version which achieves 0.7 TPS as bfloat16.

nivibilla · 2023-12-24T04:52:06Z

Np, thanks for explaining

sandangel · 2024-01-18T11:14:58Z

hi @pathorn , have you looked at https://github.com/ml-explore/mlx ? I think it will give better performance running on mac without much effort and will utilize Unified Memory on Apple Silicon devices. Do you think it's not a complex task to add a wrapper for mlx and use it as a vllm backend for mac devices? I'm looking into that but my limit understanding of ML dev make it look like not an easy task for me. They have many examples here: https://github.com/ml-explore/mlx-examples including running inference on mac. What I can do best propably using the interface from the wrapper and connect it with the OpenAI API endpoints.

rahuja23 · 2024-02-07T15:04:58Z

Is this issue resolved and merged? I am trying to install vllm package on my macM1 via pip and I get the following error:

  error: subprocess-exited-with-error
  
  × Getting requirements to build editable did not run successfully.
  │ exit code: 1
  ╰─> [31 lines of output]
      /private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
        device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
      Traceback (most recent call last):
        File "/opt/homebrew/Caskroom/miniforge/base/envs/GPT-FROM-SCRATCH/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/opt/homebrew/Caskroom/miniforge/base/envs/GPT-FROM-SCRATCH/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/homebrew/Caskroom/miniforge/base/envs/GPT-FROM-SCRATCH/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 132, in get_requires_for_build_editable
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 441, in get_requires_for_build_editable
          return self.get_requires_for_build_wheel(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 354, in <module>
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1076, in CUDAExtension
          library_dirs += library_paths(cuda=True)
                          ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1203, in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2416, in _join_cuda_home
          raise OSError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

bluenevus · 2024-02-16T19:53:38Z

+1

pathorn · 2024-03-07T01:45:18Z

Rebased onto latest vllm. Tested with OPT-175B

Still missing some ops such as gelu_and_mul such as used in gemma.

Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: "jiang1.li" <jiang1.li@intel.com>

…+bf16)

pathorn · 2024-03-07T22:59:58Z

Rebased and updated with gelu ops for Gemma.

google/gemma-2b runs pretty well on my macbook.

Updated the build instructions. Does not currently install via pip due to need to add the VLLM_TARGET_DEVICE=cpu environment variable.

BodhiHu · 2024-03-23T03:02:25Z

The mlc-llm/tvm is perhaps more suited for this. Tested on M1 with really good performance.

This was referenced Dec 22, 2023

[question] Does vllm support macos M1 or M2 chip? #1397

Closed

Can it support macos ? M2 chip. #1921

Closed

Any plan to support cpu only mode? #176

Closed

pathorn force-pushed the cpu_runner branch from 4a42329 to 8848a21 Compare December 23, 2023 01:47

pathorn mentioned this pull request Dec 27, 2023

[Feature] Prototype of vLLM execution on CPU-only devices #1028

Closed

drpicox mentioned this pull request Jan 16, 2024

MPS backend (Metal kernels) support (Apple, M1, M2) #212

Closed

pathorn force-pushed the cpu_runner branch from 8848a21 to 98252e4 Compare March 7, 2024 01:37

pathorn force-pushed the cpu_runner branch from 98252e4 to b6bfa17 Compare March 7, 2024 21:58

bigPYJ1151 and others added 6 commits March 7, 2024 14:02

Refactor setup.py

012ab78

Prototype of CPU-only execution for vLLM.

ad5d7eb

Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: "jiang1.li" <jiang1.li@intel.com>

Support for running api_server with --device cpu --enforce-eager

af5ffc6

Support macos. Add bfloat16 vector ops for Apple Silicon (-marmv8.6-a…

a2bb4bf

…+bf16)

Fix issues running on CPU

b615e2c

Workaround some new cuda-specific code

b3ca624

pathorn force-pushed the cpu_runner branch from b6bfa17 to b3ca624 Compare March 7, 2024 22:50

jagtesh mentioned this pull request Apr 19, 2024

Inquiry Regarding vLLM Support for Mac Metal API #2081

Closed

DarkLight1337 mentioned this pull request Jun 12, 2024

[Installation]: M2 Mac Dependency Torch 2.1.2 (Incompatible) #5457

Closed

lapp0 mentioned this pull request Jun 23, 2024

vLLM requirement to be relaxed dottxt-ai/outlines#989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

macos / ARM support for vllm #2244

macos / ARM support for vllm #2244

pathorn commented Dec 22, 2023 •

edited

Loading

nivibilla commented Dec 24, 2023

pathorn commented Dec 24, 2023 •

edited

Loading

nivibilla commented Dec 24, 2023

sandangel commented Jan 18, 2024 •

edited

Loading

rahuja23 commented Feb 7, 2024 •

edited

Loading

bluenevus commented Feb 16, 2024

pathorn commented Mar 7, 2024

pathorn commented Mar 7, 2024 •

edited

Loading

BodhiHu commented Mar 23, 2024

macos / ARM support for vllm #2244

Are you sure you want to change the base?

macos / ARM support for vllm #2244

Conversation

pathorn commented Dec 22, 2023 • edited Loading

Build instructions:

To build:

Testing

nivibilla commented Dec 24, 2023

pathorn commented Dec 24, 2023 • edited Loading

nivibilla commented Dec 24, 2023

sandangel commented Jan 18, 2024 • edited Loading

rahuja23 commented Feb 7, 2024 • edited Loading

bluenevus commented Feb 16, 2024

pathorn commented Mar 7, 2024

pathorn commented Mar 7, 2024 • edited Loading

BodhiHu commented Mar 23, 2024

pathorn commented Dec 22, 2023 •

edited

Loading

pathorn commented Dec 24, 2023 •

edited

Loading

sandangel commented Jan 18, 2024 •

edited

Loading

rahuja23 commented Feb 7, 2024 •

edited

Loading

pathorn commented Mar 7, 2024 •

edited

Loading