Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inference] - Cleanup/refactor #732

Merged
merged 29 commits into from
May 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions .github/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,21 @@ for serving generative LLMs while provably preserving model quality.
<img src="../img/performance.png" alt="Performance comparison" height="320"/>
</p>

## Install SpecInfer
SpecInfer is built on top of FlexFlow. You can install SpecInfer by building the inference branch of FlexFlow. Please read the [instructions](INSTALL.md) for installing FlexFlow from source code. If you would like to quickly try SpecInfer, we also provide pre-built Docker packages ([flexflow-cuda](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-cuda) with a CUDA backend, [flexflow-hip_rocm](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-hip_rocm) with a HIP-ROCM backend) with all dependencies pre-installed (N.B.: currently, the CUDA pre-built containers are only fully compatible with host machines that have CUDA 11.7 installed), together with [Dockerfiles](./docker) if you wish to build the containers manually.
## Build/Install SpecInfer
SpecInfer is built on top of FlexFlow. You can build/install SpecInfer by building the inference branch of FlexFlow. Please read the [instructions](../INSTALL.md) for building/installing FlexFlow from source code. If you would like to quickly try SpecInfer, we also provide pre-built Docker packages ([flexflow-cuda](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-cuda) with a CUDA backend, [flexflow-hip_rocm](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-hip_rocm) with a HIP-ROCM backend) with all dependencies pre-installed (N.B.: currently, the CUDA pre-built containers are only fully compatible with host machines that have CUDA 11.7 installed), together with [Dockerfiles](./docker) if you wish to build the containers manually.

## Run SpecInfer
The source code of the SpecInfer pipeline is available at [this folder](../inference/spec_infer/). The SpecInfer executable will be available at `/build_dir/inference/spec_infer/spec_infer` at compilation. You can use the following command-line arguments to run SpecInfer:

* `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
* `-ll:fsize`: size of device memory on each GPU in MB
* `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. SpecInfer keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
* `-llm-model`: the LLM model type as a case-insensitive string (e.g. "opt" or "llama")
* `-llm-weight`: path to the folder that stores the LLM weights
* `-ssm-weight`: path to the folder that stores the small speculative models' weights. You can use multiple `-ssm-weight`s in the command line to launch multiple SSMs.
* `-llm-config`: path to the json file that stores the LLM model configs
* `-ssm-model`: the LLM model type as a case-insensitive string (e.g. "opt" or "llama"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
* `-ssm-weight`: path to the folder that stores the small speculative models' weights. The number of `-ssm-weight`s must match the number of `-ssm-model`s and `-ssm-config`s.
* `-ssm-config`: path to the json file that stores the SSM model configs. The number of `-ssm-config`s must match the number of `-ssm-model`s and `-ssm-weight`s.
* `-tokenizer`: path to the tokenizer file (see [Tokenizers](#tokenizers) for preparing a tokenizer for SpecInfer).
* `-prompt`: (optional) path to the prompt file. SpecInfer expects a json format file for prompts, all of which will be served by SpecInfer. In addition, users can also use the following API for registering requests:

Expand All @@ -47,10 +51,10 @@ class RequestManager {
RequestGuid register_new_request(std::string const &prompt, int max_sequence_length);
}
```
For example, you can use the following command line to serve a LLaMA-6B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-190M models for speculative inference.
For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-190M models for speculative inference.

```bash
./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-weight /path/to/llm/weights -ssm-weight /path/to/ssm1/weights -smm-weight /path/to/ssm2/weights -tokenizer /path/to/tokenizer.model -prompt /path/to/prompt.json
./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model llama -llm-weight /path/to/llm/weights -llm-config /path/to/llm/config.json -ssm-model llama -ssm-weight /path/to/ssm1/weights -ssm-config /path/to/ssm/config.json -ssm-model llama -smm-weight /path/to/ssm2/weights -ssm-config /path/to/ssm2/config.json -tokenizer /path/to/tokenizer.model -prompt /path/to/prompt.json
```

### Tokenizers
Expand All @@ -60,7 +64,7 @@ SpecInfer supports two tokenizers:
* The GPT2 tokenizer is used to support the Open Pre-trained Transformer model family (e.g., OPT-13B and OPT-125M). To use it, download the [vocab](https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-vocab.json) and [merges](https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-merges.txt) files and pass the folder containing them as a parameter.

### LLM Weights
The weight files using in our demo is extracted from HuggingFace, and stored in our AWS S3 bucket.
The weight files used in our demo are extracted from HuggingFace, and stored in our AWS S3 bucket.

| Model | Model id on Hugging Face | Storage Location |
| :---- | :---- | :---- |
Expand All @@ -69,11 +73,14 @@ The weight files using in our demo is extracted from HuggingFace, and stored in
| OPT-6.7B | facebook/opt-6.7b | s3://specinfer/weights/opt_6B_weights.tar.gz |
| OPT-125M | facebook/opt-125m | s3://specinfer/weights/opt_125m_native.tar.gz |

You can use [this script](../inference/spec_infer/MODEL_WEIGHTS.md) to convert the weights of a HuggingFace LLM to the SpecInfer weight format.
You can use [this script](../inference/utils/download_llama_weights.py) to automatically download and convert the weights of a HuggingFace LLAMA LLM and a LLAMA SSM to the SpecInfer weight format. The script also downloads the LLAMA tokenizer. If you would like to try the OPT model instead, use [this script](../inference/utils/download_opt_weights.py) to download (and convert) the OPT weights and tokenizer.

### Prompt Datasets
We have evaluated SpecInfer on the following prompts datasets: [Chatbot instruction prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatbot.json), [ChatGPT Prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json), [WebQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/webqa.json), [Alpaca](https://specinfer.s3.us-east-2.amazonaws.com/prompts/alpaca.json), and [PIQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/piqa.json).

### Script to run the demo
You can take a look at [this script](../tests/inference_tests.sh), which is run in CI for each new commit, for an example of how to run the demo.

## Difference between SpecInfer and HuggingFace Assistant Model

There are two major differences between the two systems.
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/build-skip.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ on:
pull_request:
paths-ignore:
- "include/**"
- "inference/**"
- "cmake/**"
- "config/**"
- "python/**"
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ on:
pull_request:
paths:
- "include/**"
- "inference/**"
- "cmake/**"
- "config/**"
- "python/**"
Expand All @@ -14,6 +15,7 @@ on:
- "master"
paths:
- "include/**"
- "inference/**"
- "cmake/**"
- "config/**"
- "python/**"
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/clang-format-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ jobs:
- check: "src"
exclude: '\.proto$'
- check: "include"
- check: "inference"
- check: "nmt"
- check: "python"
- check: "scripts"
Expand Down
60 changes: 57 additions & 3 deletions .github/workflows/gpu-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ on:
- "python/**"
- "setup.py"
- "include/**"
- "inference/**"
- "src/**"
- ".github/workflows/gpu-ci.yml"
- "tests/cpp_gpu_tests.sh"
Expand All @@ -21,6 +22,7 @@ on:
- "python/**"
- "setup.py"
- "include/**"
- "inference/**"
- "src/**"
- ".github/workflows/gpu-ci.yml"
- "tests/cpp_gpu_tests.sh"
Expand Down Expand Up @@ -122,10 +124,64 @@ jobs:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
./tests/align/test_all_operators.sh

inference-tests:
name: Inference Tests
runs-on: self-hosted
defaults:
run:
shell: bash -l {0} # required to use an activated conda environment
env:
CONDA: "3"
needs: gpu-ci-concierge
container:
image: ghcr.io/flexflow/flexflow-environment-cuda:latest
options: --gpus all --shm-size=8192m
steps:
- name: Install updated git version
run: sudo add-apt-repository ppa:git-core/ppa -y && sudo apt update -y && sudo apt install -y --no-install-recommends git

- name: Checkout Git Repository
uses: actions/checkout@v3
with:
submodules: recursive

- name: Install conda and FlexFlow dependencies
uses: conda-incubator/setup-miniconda@v2
with:
miniconda-version: "latest"
activate-environment: flexflow
environment-file: conda/flexflow-cpu.yml
auto-activate-base: false

- name: Build FlexFlow
run: |
export PATH=$CONDA_PREFIX/bin:$PATH
export FF_HOME=$(pwd)
export FF_USE_PREBUILT_LEGION=OFF #remove this after fixing python path issue in Legion
export FF_BUILD_ALL_INFERENCE_EXAMPLES=ON
mkdir build
cd build
../config/config.linux
make -j

- name: Run inference tests
run: |
export PATH=$CONDA_PREFIX/bin:$PATH
export FF_HOME=$(pwd)
export CUDNN_DIR=/usr/local/cuda
export CUDA_DIR=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib

# GPT tokenizer test
./tests/gpt_tokenizer_test.sh

# Inference tests
./tests/inference_tests.sh

gpu-ci-flexflow:
name: Single Machine, Multiple GPUs Tests
runs-on: self-hosted
needs: gpu-ci-concierge
needs: inference-tests
container:
image: ghcr.io/flexflow/flexflow-environment-cuda:latest
options: --gpus all --shm-size=8192m
Expand Down Expand Up @@ -162,8 +218,6 @@ jobs:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/lib
# C++ tests
./tests/cpp_gpu_tests.sh 4
# GPT tokenizer test
./tests/gpt_tokenizer_test.sh
# Python tests
./tests/multi_gpu_tests.sh 4

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -181,3 +181,4 @@ train-labels-idx1-ubyte

# Logs
logs/
gpt_tokenizer
13 changes: 1 addition & 12 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -550,21 +550,10 @@ if(FF_BUILD_MOE OR FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
endif()

if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
add_subdirectory(examples/cpp/inference/LLAMA)
endif()

if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
add_subdirectory(examples/cpp/inference/opt)
endif()

if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
add_subdirectory(examples/cpp/inference/llama_spec_pipeline)
add_subdirectory(inference/spec_infer)
add_subdirectory(inference/incr_decoding)
endif()

if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
add_subdirectory(examples/cpp/inference/opt_spec_pipeline)
endif()

# installation
set(INCLUDE_DEST "include")
Expand Down
1 change: 1 addition & 0 deletions conda/flexflow-cpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ dependencies:
- torch --index-url https://download.pytorch.org/whl/cpu
- torchaudio --index-url https://download.pytorch.org/whl/cpu
- torchvision --index-url https://download.pytorch.org/whl/cpu
- regex
39 changes: 0 additions & 39 deletions examples/cpp/inference/LLAMA/Makefile

This file was deleted.

Loading