Skip to content

Commit

Permalink
Merge branch 'inference' into legion_workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
DerrickYLJ authored Sep 10, 2023
2 parents 345aeb9 + 4adad7d commit 8daa0f1
Show file tree
Hide file tree
Showing 160 changed files with 5,786 additions and 1,079 deletions.
29 changes: 22 additions & 7 deletions .github/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@

## News🔥:

* [09/02/2023] Adding AMD GPU support, released Docker images for ROCM 5.3->5.6
* [08/16/2023] Adding Starcoder model support
* [08/14/2023] Released Dockerfile for different CUDA versions
* [08/14/2023] Released Docker images for different CUDA versions

## What is FlexFlow Serve

Expand Down Expand Up @@ -42,13 +43,13 @@ pip install flexflow
```

### Try it in Docker
If you run into any issue during the install, or if you would like to use the C++ API without needing to install from source, you can also use our pre-built Docker package for different CUDA versions and the `hip_rocm` backend. To download and run our pre-built Docker container:
If you run into any issue during the install, or if you would like to use the C++ API without needing to install from source, you can also use our pre-built Docker package for different CUDA versions (NVIDIA backend) and multiple ROCM versions (AMD backend). To download and run our pre-built Docker container:

```bash
docker run --gpus all -it --rm --shm-size=8g ghcr.io/flexflow/flexflow-cuda-11.8:latest
docker run --gpus all -it --rm --shm-size=8g ghcr.io/flexflow/flexflow-cuda-12.0:latest
```

To download a Docker container for a backend other than CUDA v11.8, you can replace the `cuda-11.8` suffix with any of the following backends: `cuda-11.1`, `cuda-11.2`, `cuda-11.3`, `cuda-11.5`, `cuda-11.6`, `cuda-11.7`, `cuda-11.8`, and `hip_rocm`). More info on the Docker images, with instructions to build a new image from source, or run with additional configurations, can be found [here](../docker/README.md).
To download a Docker container for a backend other than CUDA v12.0, you can replace the `cuda-12.0` suffix with any of the following backends: `cuda-11.1`, `cuda-11.2`, `cuda-11.3`, `cuda-11.4`, `cuda-11.5`, `cuda-11.6`, `cuda-11.7`, `cuda-11.8`, and `hip_rocm-5.3`, `hip_rocm-5.4`, `hip_rocm-5.5`, `hip_rocm-5.6`). More info on the Docker images, with instructions to build a new image from source, or run with additional configurations, can be found [here](../docker/README.md).

### Build from source

Expand Down Expand Up @@ -209,7 +210,7 @@ Below is a list of models that we have explicitly tested and for which a SSM may
| StarCoder-15.5B | bigcode/starcoder | |

### CPU Offloading
FlexFlow Serve also offers offloading-based inference for running large models (e.g., llama-7B) on a single GPU. CPU offloading is a choice to save tensors in CPU memory, and only copy the tensor to GPU when doing calculation. Notice that now we selectively offload the largest weight tensors (weights tensor in Linear, Attention). Besides, since the small model occupies considerably less space, it it does not pose a bottleneck for GPU memory, the offloading will bring more runtime space and computational cost, so we only do the offloading for the large model. You can run the offloading example by enabling the `-offload` and `-offload-reserve-space-size` flags.
FlexFlow Serve also offers offloading-based inference for running large models (e.g., llama-7B) on a single GPU. CPU offloading is a choice to save tensors in CPU memory, and only copy the tensor to GPU when doing calculation. Notice that now we selectively offload the largest weight tensors (weights tensor in Linear, Attention). Besides, since the small model occupies considerably less space, it it does not pose a bottleneck for GPU memory, the offloading will bring more runtime space and computational cost, so we only do the offloading for the large model. [TODO: update instructions] You can run the offloading example by enabling the `-offload` and `-offload-reserve-space-size` flags.

### Quantization
FlexFlow Serve supports int4 and int8 quantization. The compressed tensors are stored on the CPU side. Once copied to the GPU, these tensors undergo decompression and conversion back to their original precision. Please find the compressed weight files in our s3 bucket, or use [this script](../inference/utils/compress_llama_weights.py) from [FlexGen](https://github.com/FMInference/FlexGen) project to do the compression manually.
Expand All @@ -221,10 +222,24 @@ We provide five prompt datasets for evaluating FlexFlow Serve: [Chatbot instruct

FlexFlow Serve is under active development. We currently focus on the following tasks and strongly welcome all contributions from bug fixes to new features and extensions.

* AMD support. We are actively working on supporting FlexFlow Serve on AMD GPUs and welcome any contributions to this effort.
* AMD benchmarking. We are actively working on benchmarking FlexFlow Serve on AMD GPUs and comparing it with the performance on NVIDIA GPUs.
* Chatbot prompt templates and Multi-round conversations
* Support for FastAPI server
* Integration with LangChain for document question answering

## Acknowledgements
This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve.
This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve. Please cite FlexFlow Serve as:

``` bibtex
@misc{miao2023specinfer,
title={SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification},
author={Xupeng Miao and Gabriele Oliaro and Zhihao Zhang and Xinhao Cheng and Zeyu Wang and Rae Ying Yee Wong and Alan Zhu and Lijie Yang and Xiaoxiang Shi and Chunan Shi and Zhuoming Chen and Daiyaan Arfeen and Reyna Abhyankar and Zhihao Jia},
year={2023},
eprint={2305.09781},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

## License
FlexFlow uses Apache License 2.0.
8 changes: 3 additions & 5 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ jobs:
uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: flexflow
environment-file: conda/environment.yml
environment-file: conda/flexflow.yml
auto-activate-base: false

- name: Build FlexFlow
Expand Down Expand Up @@ -131,15 +131,14 @@ jobs:
cd build
./tests/unit/unit-test
- name: Check availability of Python flexflow.core module
- name: Check availability of flexflow modules in Python
run: |
if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
export LD_LIBRARY_PATH="$CUDA_PATH/lib64/stubs:$LD_LIBRARY_PATH"
fi
# Remove build folder to check that the installed version can run independently of the build files
rm -rf build
export CPU_ONLY_TEST=1
python -c "import flexflow.core; exit()"
python -c "import flexflow.core; import flexflow.serve as ff; exit()"
makefile-build:
name: Build FlexFlow with the Makefile
Expand Down Expand Up @@ -186,5 +185,4 @@ jobs:
cd python
make -j $n_build_cores
export CPU_ONLY_TEST=1
python -c 'import flexflow.core'
7 changes: 4 additions & 3 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ jobs:
cuda_version: ${{ matrix.gpu_backend_version }}
hip_version: ${{ matrix.gpu_backend_version }}
branch_name: ${{ github.head_ref || github.ref_name }}
timeout-minutes: 480
steps:
- name: Checkout Git Repository
uses: actions/checkout@v3
Expand Down Expand Up @@ -100,17 +101,17 @@ jobs:
echo "Skipping build to save time"
fi
- name: Check availability of Python flexflow.core module
- name: Check availability of flexflow modules in Python
if: ${{ matrix.gpu_backend == 'cuda' }}
env:
deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' ) && env.branch_name == 'inference' }}
build_needed: ${{ ( matrix.gpu_backend == 'hip_rocm' && matrix.gpu_backend_version == '5.6' ) || ( matrix.gpu_backend == 'cuda' && matrix.gpu_backend_version == '11.8' ) }}
run: |
if [[ $deploy_needed == "true" || $build_needed == "true" ]]; then
if [[ $FF_GPU_BACKEND == "cuda" ]]; then
docker run --env CPU_ONLY_TEST=1 --entrypoint /bin/bash flexflow-${FF_GPU_BACKEND}-${gpu_backend_version}:latest -c "export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH; sudo ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1; python -c 'import flexflow.core; exit()'"
docker run --entrypoint /bin/bash flexflow-${FF_GPU_BACKEND}-${gpu_backend_version}:latest -c "export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH; sudo ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1; python -c 'import flexflow.core; import flexflow.serve as ff; exit()'"
else
docker run --env CPU_ONLY_TEST=1 --entrypoint /bin/bash flexflow-${FF_GPU_BACKEND}-${gpu_backend_version}:latest -c "python -c 'import flexflow.core; exit()'"
docker run --entrypoint /bin/bash flexflow-${FF_GPU_BACKEND}-${gpu_backend_version}:latest -c "python -c 'import flexflow.core; import flexflow.serve as ff; exit()'"
fi
else
echo "Skipping test to save time"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/helpers/install_cudnn.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ elif [[ "$cuda_version" == "11.7" ]]; then
elif [[ "$cuda_version" == "11.8" ]]; then
CUDNN_LINK=https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz
CUDNN_TARBALL_NAME=cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz
elif [[ "$cuda_version" == "11.8" ]]; then
elif [[ "$cuda_version" == "12.0" ]]; then
echo "CUDNN support for CUDA version 12.0 not yet added"
exit 1
fi
Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/helpers/install_dependencies.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ cd "${BASH_SOURCE[0]%/*}"

# General dependencies
echo "Installing apt dependencies..."
sudo apt-get update && sudo apt-get install -y --no-install-recommends wget binutils git zlib1g-dev libhdf5-dev && \
sudo apt-get update && sudo apt-get install -y --no-install-recommends wget binutils git zlib1g-dev libhdf5-dev jq && \
sudo rm -rf /var/lib/apt/lists/*

FF_GPU_BACKEND=${FF_GPU_BACKEND:-"cuda"}
Expand All @@ -20,6 +20,8 @@ fi
if [[ "$FF_GPU_BACKEND" == "cuda" || "$FF_GPU_BACKEND" = "hip_cuda" ]]; then
# Install CUDNN
./install_cudnn.sh
# Install NCCL
./install_nccl.sh
fi
# Install HIP dependencies if needed
if [[ "$FF_GPU_BACKEND" == "hip_cuda" || "$FF_GPU_BACKEND" = "hip_rocm" ]]; then
Expand Down
51 changes: 51 additions & 0 deletions .github/workflows/helpers/install_nccl.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
set -euo pipefail
set -x

# Cd into directory holding this script
cd "${BASH_SOURCE[0]%/*}"

# Add NCCL key ring
ubuntu_version=$(lsb_release -rs)
ubuntu_version=${ubuntu_version//./}
wget "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${ubuntu_version}/x86_64/cuda-keyring_1.0-1_all.deb"
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update -y
rm -f cuda-keyring_1.0-1_all.deb

# Install NCCL
cuda_version=${1:-11.8.0}
cuda_version=$(echo "${cuda_version}" | cut -f1,2 -d'.')
echo "Installing NCCL for CUDA version: ${cuda_version} ..."

# We need to run a different install command based on the CUDA version, otherwise running `sudo apt install libnccl2 libnccl-dev`
# will automatically upgrade CUDA to the latest version.

if [[ "$cuda_version" == "11.0" ]]; then
sudo apt install libnccl2=2.15.5-1+cuda11.0 libnccl-dev=2.15.5-1+cuda11.0
elif [[ "$cuda_version" == "11.1" ]]; then
sudo apt install libnccl2=2.8.4-1+cuda11.1 libnccl-dev=2.8.4-1+cuda11.1
elif [[ "$cuda_version" == "11.2" ]]; then
sudo apt install libnccl2=2.8.4-1+cuda11.2 libnccl-dev=2.8.4-1+cuda11.2
elif [[ "$cuda_version" == "11.3" ]]; then
sudo apt install libnccl2=2.9.9-1+cuda11.3 libnccl-dev=2.9.9-1+cuda11.3
elif [[ "$cuda_version" == "11.4" ]]; then
sudo apt install libnccl2=2.11.4-1+cuda11.4 libnccl-dev=2.11.4-1+cuda11.4
elif [[ "$cuda_version" == "11.5" ]]; then
sudo apt install libnccl2=2.11.4-1+cuda11.5 libnccl-dev=2.11.4-1+cuda11.5
elif [[ "$cuda_version" == "11.6" ]]; then
sudo apt install libnccl2=2.12.12-1+cuda11.6 libnccl-dev=2.12.12-1+cuda11.6
elif [[ "$cuda_version" == "11.7" ]]; then
sudo apt install libnccl2=2.14.3-1+cuda11.7 libnccl-dev=2.14.3-1+cuda11.7
elif [[ "$cuda_version" == "11.8" ]]; then
sudo apt install libnccl2=2.16.5-1+cuda11.8 libnccl-dev=2.16.5-1+cuda11.8
elif [[ "$cuda_version" == "12.0" ]]; then
sudo apt install libnccl2=2.18.3-1+cuda12.0 libnccl-dev=2.18.3-1+cuda12.0
elif [[ "$cuda_version" == "12.1" ]]; then
sudo apt install libnccl2=2.18.3-1+cuda12.1 libnccl-dev=2.18.3-1+cuda12.1
elif [[ "$cuda_version" == "12.2" ]]; then
sudo apt install libnccl2=2.18.3-1+cuda12.2 libnccl-dev=2.18.3-1+cuda12.2
else
echo "Installing NCCL for CUDA version ${cuda_version} is not supported"
exit 1
fi
5 changes: 2 additions & 3 deletions .github/workflows/pip-install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,8 @@ jobs:
# Remove build folder to check that the installed version can run independently of the build files
rm -rf build
- name: Check availability of Python flexflow.core module
- name: Check availability of flexflow modules in Python
run: |
export LD_LIBRARY_PATH="$CUDA_PATH/lib64/stubs:$LD_LIBRARY_PATH"
sudo ln -s "$CUDA_PATH/lib64/stubs/libcuda.so" "$CUDA_PATH/lib64/stubs/libcuda.so.1"
export CPU_ONLY_TEST=1
python -c "import flexflow.core; exit()"
python -c 'import flexflow.core; import flexflow.serve as ff; exit()'
77 changes: 61 additions & 16 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
cmake_minimum_required(VERSION 3.10)
project(FlexFlow)


include(ExternalProject)

# Set policy CMP0074 to eliminate cmake warnings
Expand Down Expand Up @@ -175,10 +176,6 @@ endif()
# option for nccl
option(FF_USE_NCCL "Run FlexFlow with NCCL" OFF)

if (FF_GPU_BACKEND STREQUAL "hip_rocm" AND FF_USE_NCCL STREQUAL "ON")
message(FATAL_ERROR "NCCL: ON for FF_GPU_BACKEND: hip_rocm. hip_rocm backend must have NCCL disabled.")
endif()

# option for avx2
option(FF_USE_AVX2 "Run FlexFlow with AVX2" OFF)

Expand Down Expand Up @@ -226,9 +223,6 @@ if(FF_GPU_BACKEND STREQUAL "hip_cuda" OR FF_GPU_BACKEND STREQUAL "hip_rocm")
set(ROCM_PATH "/opt/rocm" CACHE STRING "Default ROCM installation directory.")
endif()

# ZLIB
include(zlib)

# CUDA
if (FF_GPU_BACKEND STREQUAL "cuda" OR FF_GPU_BACKEND STREQUAL "hip_cuda")
include(cuda)
Expand All @@ -244,6 +238,18 @@ if (FF_GPU_BACKEND STREQUAL "cuda" OR FF_GPU_BACKEND STREQUAL "hip_cuda")
include(cudnn)
endif()


# NCCL
if(FF_USE_NCCL)
if(FF_GPU_BACKEND STREQUAL "hip_cuda" OR FF_GPU_BACKEND STREQUAL "cuda")
include(nccl)
endif()
list(APPEND FF_CC_FLAGS
-DFF_USE_NCCL)
list(APPEND FF_NVCC_FLAGS
-DFF_USE_NCCL)
endif()

# Inference tests
if(INFERENCE_TESTS)
list(APPEND FF_CC_FLAGS
Expand Down Expand Up @@ -376,11 +382,26 @@ if(NOT BUILD_LEGION_ONLY)
LIST_DIRECTORIES False
${FLEXFLOW_ROOT}/src/*.cpp)

if(BUILD_SHARED_LIBS)
add_library(flexflow SHARED ${FLEXFLOW_GPU_SRC} ${FLEXFLOW_SRC})
else()
add_library(flexflow STATIC ${FLEXFLOW_GPU_SRC} ${FLEXFLOW_SRC})
target_include_directories(hip_device_nvidia SYSTEM INTERFACE ${HIP_INCLUDE_DIRS} ${ROCM_PATH}/include)
target_include_directories(hip_device_nvidia INTERFACE ${HIP_INCLUDE_DIRS} ${ROCM_PATH}/include)

add_compile_definitions(FF_USE_HIP_CUDA)

# Linking cuda:
# We do not explicitly link cuda. hipcc when targeting nvidia will
# use nvcc under the hood. nvcc when used for linking will handle
# linking cuda dependencies
target_link_libraries(flexflow hip_device_nvidia)
elseif(FF_GPU_BACKEND STREQUAL "hip_rocm")
find_package(hipblas REQUIRED)
find_package(miopen REQUIRED)
if(FF_USE_NCCL)
find_package(rccl REQUIRED)
endif()
# find_package(rocrand REQUIRED)
find_library(HIP_RAND_LIBRARY hiprand REQUIRED)

add_compile_definitions(FF_USE_HIP_ROCM)

list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH}/hip ${ROCM_PATH})

Expand Down Expand Up @@ -440,14 +461,38 @@ if(NOT BUILD_LEGION_ONLY)
# https://rocmdocs.amd.com/en/latest/Installation_Guide/Using-CMake-with-AMD-ROCm.html
target_link_libraries(flexflow hip::device roc::hipblas MIOpen ${HIP_RAND_LIBRARY})
endif()
else()
message(FATAL_ERROR "Unsupported FF_GPU_BACKEND for cmake: ${FF_GPU_BACKEND}")
endif()

if(FF_USE_NCCL)
add_dependencies(flexflow ${NCCL_NAME})
set_property(TARGET flexflow PROPERTY HIP_ARCHITECTURES "${HIP_ARCH_LIST}")

message(STATUS "FF_GPU_BACKEND: ${FF_GPU_BACKEND}")
message(STATUS "FF_HIP_ARCH: ${FF_HIP_ARCH}")
message(STATUS "HIP_ARCH_LIST: ${HIP_ARCH_LIST}")
get_property(CHECK_HIP_ARCHS TARGET flexflow PROPERTY HIP_ARCHITECTURES)
message(STATUS "CHECK_HIP_ARCHS: ${CHECK_HIP_ARCHS}")
message(STATUS "HIP_CLANG_PATH: ${HIP_CLANG_PATH}")

# The hip cmake config module defines three targets,
# hip::amdhip64, hip::host, and hip::device.
#
# hip::host and hip::device are interface targets. hip::amdhip64 is an
# imported target for libamdhip.
#
# You do not directly link to hip::amdhip64. hip::host links to hip::amdhip64
# and hip::device links to hip::host. Link to hip::host to just use hip without
# compiling any GPU code. Link to hip::device to compile the GPU device code.
#
# Docs (outdated):
# https://rocmdocs.amd.com/en/latest/Installation_Guide/Using-CMake-with-AMD-ROCm.html
target_link_libraries(flexflow hip::device roc::hipblas MIOpen ${HIP_RAND_LIBRARY})
if(FF_USE_NCCL)
target_link_libraries(flexflow rccl)
endif()
endif()

if(FF_USE_NCCL AND (FF_GPU_BACKEND STREQUAL "hip_cuda" OR FF_GPU_BACKEND STREQUAL "cuda"))
add_dependencies(flexflow ${NCCL_NAME})
endif()

target_include_directories(flexflow PUBLIC ${FLEXFLOW_INCLUDE_DIRS})
# LEGION_URL is defined if we found a precompiled Legion library to download
if(LEGION_URL)
Expand Down
4 changes: 2 additions & 2 deletions INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ If you are planning to build the Python interface, you will need to install seve

The `conda` environment can be created and activated as:
```
conda env create -f conda/environment.yml
conda env create -f conda/flexflow.yml
conda activate flexflow
```

Expand All @@ -42,7 +42,7 @@ You can configure a FlexFlow build by running the `config/config.linux` file in
3. `FF_CUDA_ARCH` is used to set the architecture of targeted GPUs, for example, the value can be 60 if the GPU architecture is Pascal. To build for more than one architecture, pass a list of comma separated values (e.g. `FF_CUDA_ARCH=70,75`). To compile FlexFlow for all GPU architectures that are detected on the machine, pass `FF_CUDA_ARCH=autodetect` (this is the default value, so you can also leave `FF_CUDA_ARCH` unset. If you want to build for all GPU architectures compatible with FlexFlow, pass `FF_CUDA_ARCH=all`. **If your machine does not have any GPU, you have to set FF_CUDA_ARCH to at least one valid architecture code (or `all`)**, since the compiler won't be able to detect the architecture(s) automatically.
4. `FF_USE_PYTHON` controls whether to build the FlexFlow Python interface.
5. `FF_USE_NCCL` controls whether to build FlexFlow with NCCL support. By default, it is set to ON.
6. `FF_LEGION_NETWORKS` is used to enable distributed run of FlexFlow. If you want to run FlexFlow on multiple nodes, follow instructions in [MULTI-NODE.md](MULTI-NODE.md) and set the corresponding parameters as follows:
6. `FF_LEGION_NETWORKS` is used to enable distributed run of FlexFlow. If you want to run FlexFlow on multiple nodes, follow instructions in the [Multinode tutorial](https://flexflow.readthedocs.io/en/latest/multinode.html) and set the corresponding parameters as follows:
* To build FlexFlow with GASNet, set `FF_LEGION_NETWORKS=gasnet` and `FF_GASNET_CONDUIT` as a specific conduit (e.g. `ibv`, `mpi`, `udp`, `ucx`) in `config/config.linux` when configuring the FlexFlow build. Set `FF_UCX_URL` when you want to customize the URL to download UCX.
* To build FlexFlow with native UCX, set `FF_LEGION_NETWORKS=ucx` in `config/config.linux` when configuring the FlexFlow build. Set `FF_UCX_URL` when you want to customize the URL to download UCX.
8. `FF_BUILD_EXAMPLES` controls whether to build all C++ example programs.
Expand Down
Loading

0 comments on commit 8daa0f1

Please sign in to comment.