Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP upstream #53

Closed
wants to merge 102 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
3827084
[Bugfix] Fix for inconsistent behaviour related to sampling and repet…
tdoublep Jun 18, 2024
c9e728e
[Doc] Added cerebrium as Integration option (#5553)
milo157 Jun 18, 2024
acd3ed8
[Bugfix] Fix CUDA version check for mma warning suppression (#5642)
tlrmchlsmth Jun 18, 2024
4138d6a
[Bugfix] Fix w8a8 benchmarks for int8 case (#5643)
tlrmchlsmth Jun 19, 2024
edd595d
[Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628)
ShukantPal Jun 19, 2024
6bddc7c
[Bugfix] Added test for sampling repetition penalty bug. (#5659)
tdoublep Jun 19, 2024
70b2a3a
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate…
hongxiayang Jun 19, 2024
59305a2
[misc][distributed] use 127.0.0.1 for single-node (#5619)
youkaichao Jun 19, 2024
974e116
[Model] Add FP8 kv cache for Qwen2 (#5656)
mgoin Jun 19, 2024
f1e5642
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684)
Isotr0py Jun 19, 2024
5da4334
[Misc]Add param max-model-len in benchmark_latency.py (#5629)
DearPlanet Jun 19, 2024
ebd2a34
[CI/Build] Add tqdm to dependencies (#5680)
DarkLight1337 Jun 19, 2024
cac911b
[ci] Add A100 queue into AWS CI template (#5648)
khluu Jun 19, 2024
c171a59
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg…
mgoin Jun 19, 2024
641d67f
[ci][distributed] add tests for custom allreduce (#5689)
youkaichao Jun 19, 2024
2e8f069
[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654)
zifeitong Jun 19, 2024
36847f7
[Doc] Update docker references (#5614)
rafvasq Jun 19, 2024
86a7ee1
[Misc] Add per channel support for static activation quantization; up…
dsikka Jun 19, 2024
1933ff9
[ci] Limit num gpus if specified for A100 (#5694)
khluu Jun 19, 2024
212ff7b
[Misc] Improve conftest (#5681)
DarkLight1337 Jun 20, 2024
ff63325
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703)
ywang96 Jun 20, 2024
38518ab
[Kernel] Update Cutlass int8 kernel configs for SM90 (#5514)
varun-sundar-rabindranath Jun 20, 2024
492bbcf
[Model] Port over CLIPVisionModel for VLMs (#5591)
ywang96 Jun 20, 2024
1ecb8fa
[Kernel] Update Cutlass int8 kernel configs for SM80 (#5275)
varun-sundar-rabindranath Jun 20, 2024
5ac2be8
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS ke…
tlrmchlsmth Jun 20, 2024
5f45fa3
[Frontend] Add FlexibleArgumentParser to support both underscore and …
mgoin Jun 20, 2024
ecd1d41
[distributed][misc] use fork by default for mp (#5669)
youkaichao Jun 21, 2024
426c751
[Model] MLPSpeculator speculative decoding support (#4947)
JRosenkranz Jun 21, 2024
fe3073f
[Kernel] Add punica dimension for Qwen2 LoRA (#5441)
jinzhen-lin Jun 21, 2024
0b31513
[BugFix] Fix test_phi3v.py (#5725)
CatherineSue Jun 21, 2024
9cbb3b2
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665)
jeejeelee Jun 21, 2024
c77e009
[Core][Distributed] add shm broadcast (#5399)
youkaichao Jun 21, 2024
7e6f607
[Kernel][CPU] Add Quick `gelu` to CPU (#5717)
ywang96 Jun 21, 2024
84aecf4
[Doc] Documentation on supported hardware for quantization methods (#…
mgoin Jun 21, 2024
33ef5ce
[BugFix] exclude version 1.15.0 for modelscope (#5668)
zhyncs Jun 21, 2024
8059621
[ci][test] fix ca test in main (#5746)
youkaichao Jun 21, 2024
39cd6f7
[LoRA] Add support for pinning lora adapters in the LRU cache (#5603)
rohithkrn Jun 21, 2024
d21609d
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616)
jikunshang Jun 22, 2024
eddfce3
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs…
DamonFool Jun 22, 2024
a8e2142
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_ba…
zifeitong Jun 22, 2024
0223d42
[Bugfix] Fix pin_lora error in TPU executor (#5760)
WoosukKwon Jun 22, 2024
c86bf89
[Docs][TPU] Add installation tip for TPU (#5761)
WoosukKwon Jun 22, 2024
cdcddca
[core][distributed] improve shared memory broadcast (#5754)
youkaichao Jun 22, 2024
6083490
[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744)
varun-sundar-rabindranath Jun 23, 2024
0e0e955
[Distributed] Add send and recv helpers (#5719)
andoorve Jun 23, 2024
3e8a070
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requi…
Isotr0py Jun 24, 2024
46626dd
[doc][faq] add warning to download models for every nodes (#5783)
youkaichao Jun 24, 2024
2b255cd
[Doc] Add "Suggest edit" button to doc pages (#5789)
mgoin Jun 24, 2024
ccaeb0c
[Doc] Add Phi-3-medium to list of supported models (#5788)
mgoin Jun 24, 2024
e0e5db9
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args…
CatherineSue Jun 24, 2024
ca77a04
[ci] Remove aws template (#5757)
khluu Jun 25, 2024
971e0b6
[Doc] Add notice about breaking changes to VLMs (#5818)
DarkLight1337 Jun 25, 2024
9b35cf9
[Speculative Decoding] Support draft model on different tensor-paral…
wooyeonlee0 Jun 25, 2024
b1f3979
[Misc] Remove useless code in cpu_worker (#5824)
DamonFool Jun 25, 2024
667f8af
[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748)
Yard1 Jun 25, 2024
f13d9fa
[doc][distributed] add both gloo and nccl tests (#5834)
youkaichao Jun 25, 2024
f6b5eed
[CI/Build] Add unit testing for FlexibleArgumentParser (#5798)
mgoin Jun 25, 2024
88f53e8
[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16`…
dsikka Jun 25, 2024
6873c10
[Hardware][TPU] Refactor TPU backend (#5831)
WoosukKwon Jun 25, 2024
a420c75
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improv…
mawong-amd Jun 25, 2024
c329632
[Hardware][TPU] Raise errors for unsupported sampling params (#5850)
WoosukKwon Jun 25, 2024
62c44b6
[CI/Build] Add E2E tests for MLPSpeculator (#5791)
tdoublep Jun 26, 2024
c43d5a6
[Bugfix] Fix assertion in NeuronExecutor (#5841)
aws-patlange Jun 26, 2024
a576089
[Core] Refactor Worker and ModelRunner to consolidate control plane c…
stephanie-wang Jun 26, 2024
a06a453
[Misc][Doc] Add Example of using OpenAI Server with VLM (#5832)
ywang96 Jun 26, 2024
226da66
[bugfix][distributed] fix shm broadcast when the queue size is full (…
youkaichao Jun 26, 2024
3732d40
[Bugfix] Fix embedding to support 2D inputs (#5829)
WoosukKwon Jun 26, 2024
3b046b6
[Bugfix][TPU] Fix KV cache size calculation (#5860)
WoosukKwon Jun 26, 2024
c2b5da8
[CI/Build] Refactor image test assets (#5821)
DarkLight1337 Jun 26, 2024
353584f
[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560)
ProExpertProg Jun 26, 2024
1d2addf
[Frontend] Add tokenize/detokenize endpoints (#5054)
sasha0552 Jun 26, 2024
a2b7e0b
[Hardware][TPU] Support parallel sampling & Swapping (#5855)
WoosukKwon Jun 26, 2024
3b6979e
[Bugfix][TPU] Fix CPU cache allocation (#5869)
WoosukKwon Jun 26, 2024
3784c2a
Support CPU inference with VSX PowerPC ISA (#5652)
ChipKerchner Jun 26, 2024
1222b26
[doc] update usage of env var to avoid conflict (#5873)
youkaichao Jun 26, 2024
e4bb671
[Misc] Add example for LLaVA-NeXT (#5879)
ywang96 Jun 27, 2024
41c6083
[BugFix] Fix cuda graph for MLPSpeculator (#5875)
njhill Jun 27, 2024
cb6f147
[Doc] Add note about context length in Phi-3-Vision example (#5887)
DarkLight1337 Jun 27, 2024
5681b46
[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted prop…
xwjiang2010 Jun 27, 2024
ce99923
[Model] Add base class for LoRA-supported models (#5018)
DarkLight1337 Jun 27, 2024
b5860a4
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888)
ywang96 Jun 27, 2024
cfa98e5
[CI/Build] [1/3] Reorganize entrypoints tests (#5526)
DarkLight1337 Jun 27, 2024
6816d85
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896)
DarkLight1337 Jun 27, 2024
2bd11ea
[doc][misc] add note for Kubernetes users (#5916)
youkaichao Jun 27, 2024
a13f99d
[BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (#5…
njhill Jun 27, 2024
d7b8ec2
[BugFix] Fix `min_tokens` behaviour for multiple eos tokens (#5849)
njhill Jun 27, 2024
807b752
[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922)
ywang96 Jun 27, 2024
22b74a7
[Model] Add Gemma 2 (#5908)
WoosukKwon Jun 27, 2024
72879ae
[core][misc] remove logical block (#5882)
youkaichao Jun 27, 2024
f1d3956
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932)
divakar-amd Jun 27, 2024
d2de291
[Hardware][TPU] Optimize KV cache swapping (#5878)
WoosukKwon Jun 28, 2024
a9c1105
[VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast prope…
xwjiang2010 Jun 28, 2024
d98dc55
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU…
Isotr0py Jun 28, 2024
bfdb6cb
[Core] Registry for processing model inputs (#5214)
DarkLight1337 Jun 28, 2024
8f1a950
Unmark fused_moe config json file as executable (#5960)
tlrmchlsmth Jun 28, 2024
b9fc37f
[Hardware][Intel] OpenVINO vLLM backend (#5379)
ilya-lavrenov Jun 28, 2024
f0c65ad
[Bugfix] Better error message for MLPSpeculator when `num_speculative…
tdoublep Jun 28, 2024
f496ed6
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
DarkLight1337 Jun 28, 2024
84b75bf
[Distributed] Make it clear that % should not be in tensor dict keys.…
xwjiang2010 Jun 28, 2024
767b2a0
[Spec Decode] Introduce DraftModelRunner (#5799)
comaniac Jun 28, 2024
ad0a000
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
tlrmchlsmth Jun 28, 2024
5f6524e
🚧 push images
prashantgupta24 Jun 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ steps:
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
Expand Down
14 changes: 14 additions & 0 deletions .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This script build the OpenVINO docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t openvino-test -f Dockerfile.openvino .

# Setup cleanup
remove_docker_container() { docker rm -f openvino-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py
40 changes: 35 additions & 5 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# In this file, you can add more tests to run either by adding a new step or
# adding a new command to an existing step. See different options here for examples.
# This script will be feed into Jinja template in `test-template-aws.j2` to generate
# the final pipeline yaml file.

# This script will be feed into Jinja template in `test-template-aws.j2` at
# https://github.com/vllm-project/buildkite-ci/blob/main/scripts/test-template-aws.j2
# to generate the final pipeline yaml file.


steps:
- label: Regression Test
Expand All @@ -24,19 +27,26 @@ steps:

- label: Core Test
mirror_hardwares: [amd]
command: pytest -v -s core
commands:
- pytest -v -s core
- pytest -v -s distributed/test_parallel_state.py

- label: Distributed Comm Ops Test
#mirror_hardwares: [amd]
command: pytest -v -s distributed/test_comm_ops.py
working_dir: "/vllm-workspace/tests"
num_gpus: 2
commands:
- pytest -v -s distributed/test_comm_ops.py
- pytest -v -s distributed/test_shm_broadcast.py

- label: Distributed Tests (2 GPUs)
mirror_hardwares: [amd]
working_dir: "/vllm-workspace/tests"
num_gpus: 2
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
Expand All @@ -46,7 +56,7 @@ steps:
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s spec_decode/e2e/test_integration_dist.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s distributed/test_utils.py

Expand All @@ -55,11 +65,15 @@ steps:
working_dir: "/vllm-workspace/tests"
num_gpus: 4
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s distributed/test_pynccl.py
# We want to test that models which use 2 GPUs work with 4 GPUs, which is why we duplicate them here.
# See https://github.com/vllm-project/vllm/pull/5473#issuecomment-2166601837 for context.
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py

- label: Engine Test
mirror_hardwares: [amd]
Expand Down Expand Up @@ -145,6 +159,9 @@ steps:
num_gpus: 4
# This test runs llama 13B, so it is required to run on 4 GPUs.
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s -x lora/test_long_context.py

- label: Tensorizer Test
Expand Down Expand Up @@ -181,3 +198,16 @@ steps:
commands:
- pip install -r requirements-docs.txt
- SPHINXOPTS=\"-W\" make html

- label: Distributed Tests (A100)
gpu: a100
num_gpus: 4
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
# NOTE: don't test llama model here, it seems hf implementation is buggy
# see https://github.com/vllm-project/vllm/pull/5689 for details
- pytest -v -s distributed/test_custom_all_reduce.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
93 changes: 0 additions & 93 deletions .buildkite/test-template-aws.j2

This file was deleted.

2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ on:

push:
branches:
- release
- wip-upstream
paths-ignore:
- "**.md"
- "proto/**"
Expand Down
23 changes: 8 additions & 15 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ cmake_minimum_required(VERSION 3.21)

project(vllm_extensions LANGUAGES CXX)

option(VLLM_TARGET_DEVICE "Target device backend for vLLM" "cuda")
# CUDA by default, can be overridden by using -DVLLM_TARGET_DEVICE=... (used by setup.py)
set(VLLM_TARGET_DEVICE "cuda" CACHE STRING "Target device backend for vLLM")

message(STATUS "Build type: ${CMAKE_BUILD_TYPE}")
message(STATUS "Target device: ${VLLM_TARGET_DEVICE}")
Expand Down Expand Up @@ -32,8 +33,7 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx11
# versions are derived from Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.3.0")
set(TORCH_SUPPORTED_VERSION_ROCM_5X "2.0.1")
set(TORCH_SUPPORTED_VERSION_ROCM_6X "2.1.1")
set(TORCH_SUPPORTED_VERSION_ROCM "2.4.0")

#
# Try to find python package with an executable that exactly matches
Expand Down Expand Up @@ -98,18 +98,11 @@ elseif(HIP_FOUND)
# .hip extension automatically, HIP must be enabled explicitly.
enable_language(HIP)

# ROCm 5.x
if (ROCM_VERSION_DEV_MAJOR EQUAL 5 AND
NOT Torch_VERSION VERSION_EQUAL ${TORCH_SUPPORTED_VERSION_ROCM_5X})
message(WARNING "Pytorch version ${TORCH_SUPPORTED_VERSION_ROCM_5X} "
"expected for ROCMm 5.x build, saw ${Torch_VERSION} instead.")
endif()

# ROCm 6.x
if (ROCM_VERSION_DEV_MAJOR EQUAL 6 AND
NOT Torch_VERSION VERSION_EQUAL ${TORCH_SUPPORTED_VERSION_ROCM_6X})
message(WARNING "Pytorch version ${TORCH_SUPPORTED_VERSION_ROCM_6X} "
"expected for ROCMm 6.x build, saw ${Torch_VERSION} instead.")
# ROCm 5.X and 6.X
if (ROCM_VERSION_DEV_MAJOR GREATER_EQUAL 5 AND
NOT Torch_VERSION VERSION_EQUAL ${TORCH_SUPPORTED_VERSION_ROCM})
message(WARNING "Pytorch version ${TORCH_SUPPORTED_VERSION_ROCM} "
"expected for ROCm build, saw ${Torch_VERSION} instead.")
endif()
else()
message(FATAL_ERROR "Can't find CUDA or HIP installation.")
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ FROM vllm-base AS vllm-openai

# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer modelscope
pip install accelerate hf_transfer 'modelscope!=1.15.0'

ENV VLLM_USAGE_SOURCE production-docker-image

Expand Down
26 changes: 26 additions & 0 deletions Dockerfile.openvino
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# The vLLM Dockerfile is used to construct vLLM image that can be directly used
# to run the OpenAI compatible server.

FROM ubuntu:22.04 AS dev

RUN apt-get update -y && \
apt-get install -y python3-pip git
WORKDIR /workspace

# copy requirements
COPY requirements-build.txt /workspace/vllm/
COPY requirements-common.txt /workspace/vllm/
COPY requirements-openvino.txt /workspace/vllm/

COPY vllm/ /workspace/vllm/vllm
COPY setup.py /workspace/vllm/

# install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt
# build vLLM with OpenVINO backend
RUN PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/

COPY examples/ /workspace/vllm/examples
COPY benchmarks/ /workspace/vllm/benchmarks

CMD ["/bin/bash"]
22 changes: 22 additions & 0 deletions Dockerfile.ppc64le
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM mambaorg/micromamba
ARG MAMBA_DOCKERFILE_ACTIVATE=1
USER root

RUN apt-get update -y && apt-get install -y git wget vim numactl gcc-12 g++-12 protobuf-compiler libprotobuf-dev && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

# Some packages in requirements-cpu are installed here
# IBM provides optimized packages for ppc64le processors in the open-ce project for mamba
# Currently these may not be available for venv or pip directly
RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p10/ -c defaults python=3.10 pytorch-cpu=2.1.2 torchvision-cpu=0.16.2 && micromamba clean --all --yes

COPY ./ /workspace/vllm

WORKDIR /workspace/vllm

# These packages will be in rocketce eventually
RUN pip install -v -r requirements-cpu.txt --prefer-binary --extra-index-url https://repo.fury.io/mgiessing

RUN VLLM_TARGET_DEVICE=cpu python3 setup.py install

WORKDIR /vllm-workspace
ENTRYPOINT ["/opt/conda/bin/python3", "-m", "vllm.entrypoints.openai.api_server"]
Loading
Loading