Skip to content

Commit

Permalink
♻️ install vllm using wheels (#19)
Browse files Browse the repository at this point in the history
Update dockerfile.ubi to build vllm using wheels! I had to update some
`init` files since we need those packages to be picked up when building
the wheel for vllm.

### Integration tests


https://v3.travis.ibm.com/github/ai-foundation/fmaas-inference-server/builds/17962397

Image pushed to quay for testing:
```
quay.io/wxpe/tgis-vllm:release-vllm-wheel.eec7a7b
```

<img width="1020" alt="Screenshot 2024-04-23 at 12 18 00"
src="https://github.com/IBM/vllm/assets/9909241/f261bc38-d1f9-4d1a-a5d6-9db14aa362a6">

Useful command to build the above tests:
```
env:
  global:
    - REMOTE_INTEGRATION_TESTS=true
    - REMOTE_INTEGRATION_TEST_IMAGE=quay.io/wxpe/tgis-vllm:release-vllm-wheel.eec7a7b
    - REMOTE_INTEGRATION_TEST_CONFIG=product.vllm
```
---

<details>
<!-- inside this <details> section, markdown rendering does not work, so
we use raw html here. -->
<summary><b> PR Checklist (Click to Expand) </b></summary>

<p>Thank you for your contribution to vLLM! Before submitting the pull
request, please ensure the PR meets the following criteria. This helps
vLLM maintain the code quality and improve the efficiency of the review
process.</p>

<h3>PR Title and Classification</h3>
<p>Only specific types of PRs will be reviewed. The PR title is prefixed
appropriately to indicate the type of change. Please use one of the
following:</p>
<ul>
    <li><code>[Bugfix]</code> for bug fixes.</li>
<li><code>[CI/Build]</code> for build or continuous integration
improvements.</li>
<li><code>[Doc]</code> for documentation fixes and improvements.</li>
<li><code>[Model]</code> for adding a new model or improving an existing
model. Model name should appear in the title.</li>
<li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
OpenAI API server, <code>LLM</code> class, etc.) </li>
<li><code>[Kernel]</code> for changes affecting CUDA kernels or other
compute kernels.</li>
<li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
<code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
<code>Scheduler</code>, etc.)</li>
<li><code>[Hardware][Vendor]</code> for hardware-specific changes.
Vendor name should appear in the prefix (e.g.,
<code>[Hardware][AMD]</code>).</li>
<li><code>[Misc]</code> for PRs that do not fit the above categories.
Please use this sparingly.</li>
</ul>
<p><strong>Note:</strong> If the PR spans more than one category, please
include all relevant prefixes.</p>

<h3>Code Quality</h3>

<p>The PR need to meet the following code quality standards:</p>

<ul>
<li>We adhere to <a
href="https://google.github.io/styleguide/pyguide.html">Google Python
style guide</a> and <a
href="https://google.github.io/styleguide/cppguide.html">Google C++
style guide</a>.</li>
<li>Pass all linter checks. Please use <a
href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
to format your code.</li>
<li>The code need to be well-documented to ensure future contributors
can easily understand the code.</li>
<li>Include sufficient tests to ensure the project to stay correct and
robust. This includes both unit tests and integration tests.</li>
<li>Please add documentation to <code>docs/source/</code> if the PR
modifies the user-facing behaviors of vLLM. It helps vLLM user
understand and utilize the new features or changes.</li>
</ul>

<h3>Notes for Large Changes</h3>
<p>Please keep the changes as concise as possible. For major
architectural changes (>500 LOC excluding kernel/data/config/test), we
would expect a GitHub issue (RFC) discussing the technical design and
justification. Otherwise, we will tag it with <code>rfc-required</code>
and might not go through the PR.</p>

<h3>What to Expect for the Reviews</h3>

<p>The goal of the vLLM team is to be a <i>transparent reviewing
machine</i>. We would like to make the review process transparent and
efficient and make sure no contributor feel confused or frustrated.
However, the vLLM team is small, so we need to prioritize some PRs over
others. Here is what you can expect from the review process: </p>

<ul>
<li> After the PR is submitted, the PR will be assigned to a reviewer.
Every reviewer will pick up the PRs based on their expertise and
availability.</li>
<li> After the PR is assigned, the reviewer will provide status update
every 2-3 days. If the PR is not reviewed within 7 days, please feel
free to ping the reviewer or the vLLM team.</li>
<li> After the review, the reviewer will put an <code>
action-required</code> label on the PR if there are changes required.
The contributor should address the comments and ping the reviewer to
re-review the PR.</li>
<li> Please respond to all comments within a reasonable time frame. If a
comment isn't clear or you disagree with a suggestion, feel free to ask
for clarification or discuss the suggestion.
 </li>
</ul>

<h3>Thank You</h3>

<p> Finally, thank you for taking the time to read these guidelines and
for your interest in contributing to vLLM. Your contributions make vLLM
a great tool for everyone! </p>


</details>

---------

Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
  • Loading branch information
prashantgupta24 authored May 3, 2024
1 parent 8c548e4 commit 3bc3053
Show file tree
Hide file tree
Showing 8 changed files with 109 additions and 103 deletions.
173 changes: 71 additions & 102 deletions Dockerfile.ubi
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Please update any changes made here to
# docs/source/dev/dockerfile-ubi/dockerfile-ubi.rst

## Global Args #################################################################
ARG BASE_UBI_IMAGE_TAG=9.3-1612
ARG PYTHON_VERSION=3.11
Expand Down Expand Up @@ -39,25 +42,6 @@ RUN curl -fsSL -o ~/miniforge3.sh -O "https://github.com/conda-forge/miniforge/
# use of the /opt/vllm env requires:
# ENV PATH=/opt/vllm/bin/:$PATH


## Python Base #################################################################
FROM base as python-base

COPY --from=python-install --link /opt/vllm /opt/vllm

ENV PATH=/opt/vllm/bin/:$PATH


## Python/Torch Base ###########################################################
FROM python-base as python-torch-base

ARG PYTORCH_INDEX
ARG PYTORCH_VERSION

RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install torch==$PYTORCH_VERSION+cu121 --index-url "${PYTORCH_INDEX}/cu121"


## CUDA Base ###################################################################
FROM base as cuda-base

Expand Down Expand Up @@ -129,12 +113,22 @@ ENV LIBRARY_PATH="$CUDA_HOME/lib64/stubs"
# or future versions of triton.
RUN ldconfig /usr/local/cuda-12.2/compat/

## Development #################################################################
FROM cuda-devel AS dev
## Python cuda base #################################################################
FROM cuda-devel as python-cuda-base

COPY --from=python-torch-base --link /opt/vllm /opt/vllm
COPY --from=python-install --link /opt/vllm /opt/vllm
ENV PATH=/opt/vllm/bin/:$PATH

# install cuda and common dependencies
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=requirements-common.txt,target=requirements-common.txt \
--mount=type=bind,source=requirements-cuda.txt,target=requirements-cuda.txt \
pip3 install \
-r requirements-cuda.txt

## Development #################################################################
FROM python-cuda-base AS dev

# install build and runtime dependencies
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=requirements-common.txt,target=requirements-common.txt \
Expand All @@ -144,6 +138,38 @@ RUN --mount=type=cache,target=/root/.cache/pip \
-r requirements-cuda.txt \
-r requirements-dev.txt

## Proto Compilation ###########################################################
FROM python-install AS gen-protos

ENV PATH=/opt/vllm/bin/:$PATH

RUN microdnf install -y \
make \
findutils \
&& microdnf clean all

RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=Makefile,target=Makefile \
--mount=type=bind,source=proto,target=proto \
make gen-protos

## Extension Cache #############################################################
# Instead of compiling artifacts every build just copy from pre-built wheel
# This might not work if the PyTorch and CUDA versions don't match!
FROM base as prebuilt-wheel

RUN microdnf install -y \
unzip \
&& microdnf clean all

ARG PYTHON_VERSION
# 0.4.1 is built for CUDA 12.1 and PyTorch 2.1.2
ARG VLLM_WHEEL_VERSION=0.4.1

RUN curl -Lo vllm.whl https://github.com/vllm-project/vllm/releases/download/v${VLLM_WHEEL_VERSION}/vllm-${VLLM_WHEEL_VERSION}-cp${PYTHON_VERSION//.}-cp${PYTHON_VERSION//.}-manylinux1_x86_64.whl \
&& unzip vllm.whl \
&& rm vllm.whl
# compiled extensions located at /workspace/vllm/*.so

## Builder #####################################################################
FROM dev AS build
Expand Down Expand Up @@ -179,26 +205,22 @@ ENV VLLM_INSTALL_PUNICA_KERNELS=1
ENV PATH=/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

RUN python3 setup.py build_ext --inplace

# Copy the entire directory before building wheel
COPY --link vllm vllm

## Extension Cache #############################################################
# Instead of compiling artifacts every build just copy from pre-built wheel
# This might not work if the PyTorch and CUDA versions don't match!
FROM base as prebuilt-wheel
# Comment if building *.so files from scratch
##################################################
# Copy the prebuilt *.so files
COPY --from=prebuilt-wheel --link /workspace/vllm/*.so /workspace/vllm/
##################################################

RUN microdnf install -y \
unzip \
&& microdnf clean all

ARG PYTHON_VERSION
# 0.4.0.post1 is built for CUDA 12.1 and PyTorch 2.1.2
ARG VLLM_WHEEL_VERSION=0.4.0.post1
# Copy over the generated *.pb2 files
COPY --from=gen-protos --link /workspace/vllm/entrypoints/grpc/pb vllm/entrypoints/grpc/pb

RUN curl -Lo vllm.whl https://github.com/vllm-project/vllm/releases/download/v${VLLM_WHEEL_VERSION}/vllm-${VLLM_WHEEL_VERSION}-cp${PYTHON_VERSION//.}-cp${PYTHON_VERSION//.}-manylinux1_x86_64.whl \
&& unzip vllm.whl \
&& rm vllm.whl
# compiled extensions located at /workspace/vllm/*.so
ENV CCACHE_DIR=/root/.cache/ccache
RUN --mount=type=cache,target=/root/.cache/ccache \
--mount=type=cache,target=/root/.cache/pip \
python3 setup.py bdist_wheel --dist-dir=dist

#################### FLASH_ATTENTION Build IMAGE ####################
FROM dev as flash-attn-builder
Expand All @@ -219,58 +241,6 @@ WORKDIR /usr/src/flash-attention-v2
RUN pip --verbose wheel flash-attn==${FLASH_ATTN_VERSION} \
--no-build-isolation --no-deps --no-cache-dir


## Test ########################################################################
FROM dev AS test

WORKDIR /vllm-workspace
# ADD is used to preserve directory structure
# NB: Could leak secrets from local context, the test image should not be pushed
# to a registry
ADD . /vllm-workspace/
# copy pytorch extensions separately to avoid having to rebuild
# when python code changes
COPY --from=build /workspace/vllm/*.so /vllm-workspace/vllm/
# Install flash attention (from pre-built wheel)
RUN --mount=type=bind,from=flash-attn-builder,src=/usr/src/flash-attention-v2,target=/usr/src/flash-attention-v2 \
pip install /usr/src/flash-attention-v2/*.whl --no-cache-dir
# ignore build dependencies installation because we are using pre-complied extensions
RUN rm pyproject.toml
RUN --mount=type=cache,target=/root/.cache/pip \
VLLM_USE_PRECOMPILED=1 pip install . --verbose


## Proto Compilation ###########################################################
FROM python-base AS gen-protos

RUN microdnf install -y \
make \
findutils \
&& microdnf clean all

RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=Makefile,target=Makefile \
--mount=type=bind,source=proto,target=proto \
make gen-protos

## vLLM Library Files ##########################################################
# Little extra stage to gather files and manage permissions on them without any
# duplication in the release layer due to permission changes
FROM base AS vllm

WORKDIR /vllm-staging
# COPY files from various places into a staging directory
COPY --link vllm vllm
COPY --from=build --link /workspace/vllm/*.so vllm/
COPY --from=gen-protos --link /workspace/vllm/entrypoints/grpc/pb vllm/entrypoints/grpc/pb

# custom COPY command to use umask to control permissions and grant permissions
# to the group
RUN umask 002 \
&& cp --recursive --no-preserve=all /vllm-staging/vllm /workspace/vllm \
# not strictly needed, but .so files typically have executable bits
&& chmod +x /workspace/vllm/*.so

## Release #####################################################################
# Note from the non-UBI Dockerfile:
# We used base cuda image because pytorch installs its own cuda libraries.
Expand All @@ -281,28 +251,27 @@ FROM cuda-runtime AS vllm-openai
WORKDIR /workspace

# Create release python environment
COPY --from=python-torch-base --link /opt/vllm /opt/vllm
COPY --from=python-cuda-base --link /opt/vllm /opt/vllm
ENV PATH=/opt/vllm/bin/:$PATH

# install vllm wheel first, so that torch etc will be installed
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/workspace/dist \
--mount=type=cache,target=/root/.cache/pip \
pip install dist/*.whl --verbose

# Install flash attention (from pre-built wheel)
RUN --mount=type=bind,from=flash-attn-builder,src=/usr/src/flash-attention-v2,target=/usr/src/flash-attention-v2 \
pip install /usr/src/flash-attention-v2/*.whl --no-cache-dir

RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=requirements-common.txt,target=requirements-common.txt \
--mount=type=bind,source=requirements-cuda.txt,target=requirements-cuda.txt \
pip3 install \
-r requirements-cuda.txt \
# additional dependencies for the TGIS gRPC server
grpcio-tools==1.62.1 \
# additional dependencies for openai api_server
accelerate==0.28.0 \
# hf_transfer for faster HF hub downloads
hf_transfer==0.1.6

# Install flash attention (from pre-built wheel)
RUN --mount=type=bind,from=flash-attn-builder,src=/usr/src/flash-attention-v2,target=/usr/src/flash-attention-v2 \
pip3 install /usr/src/flash-attention-v2/*.whl --no-cache-dir

# vLLM will not be installed in site-packages
COPY --from=vllm --link /workspace/ ./

# Triton needs a CC compiler
RUN microdnf install -y gcc \
&& microdnf clean all
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions docs/source/dev/dockerfile-ubi/dockerfile-ubi.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
Dockerfile-ubi
====================

- Visualization of the multi-stage Dockerfile.ubi

.. figure:: ../../assets/dev/dockerfile-ubi-dependency-graph.png
:alt: query
:width: 100%
:align: center

Made using: https://github.com/patrickhoefler/dockerfilegraph

Commands to regenerate it:

.. code:: bash
dockerfilegraph -o png --legend --dpi 200 --max-label-length 50 -f Dockerfile.ubi
or in case you want to run it directly with the docker image:

.. code:: bash
docker run \
--rm \
--user "$(id -u):$(id -g)" \
--workdir /workspace \
--volume "$(pwd)":/workspace \
ghcr.io/patrickhoefler/dockerfilegraph:alpine \
--output png \
--dpi 200 \
--max-label-length 50 \
--legend \
-f Dockerfile.ubi
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ Documentation
dev/sampling_params
dev/engine/engine_index
dev/kernel/paged_attention
dev/dockerfile-ubi/dockerfile-ubi

Indices and tables
==================
Expand Down
Empty file.
3 changes: 2 additions & 1 deletion vllm/entrypoints/grpc/pb/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
*.py
*.py-e
*.pyi
*.pyi
!__init__.py
Empty file.
Empty file added vllm/tgis_utils/__init__.py
Empty file.

0 comments on commit 3bc3053

Please sign in to comment.