flexflow · goliaro · May 21, 2023 · May 20, 2023 · May 20, 2023 · May 20, 2023
diff --git a/.github/README.md b/.github/README.md
@@ -28,17 +28,21 @@ for serving generative LLMs while provably preserving model quality.
 <img src="../img/performance.png" alt="Performance comparison" height="320"/>
 </p>
 
-## Install SpecInfer
-SpecInfer is built on top of FlexFlow. You can install SpecInfer by building the inference branch of FlexFlow. Please read the [instructions](INSTALL.md) for installing FlexFlow from source code. If you would like to quickly try SpecInfer, we also provide pre-built Docker packages ([flexflow-cuda](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-cuda) with a CUDA backend, [flexflow-hip_rocm](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-hip_rocm) with a HIP-ROCM backend) with all dependencies pre-installed (N.B.: currently, the CUDA pre-built containers are only fully compatible with host machines that have CUDA 11.7 installed), together with [Dockerfiles](./docker) if you wish to build the containers manually. 
+## Build/Install SpecInfer
+SpecInfer is built on top of FlexFlow. You can build/install SpecInfer by building the inference branch of FlexFlow. Please read the [instructions](../INSTALL.md) for building/installing FlexFlow from source code. If you would like to quickly try SpecInfer, we also provide pre-built Docker packages ([flexflow-cuda](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-cuda) with a CUDA backend, [flexflow-hip_rocm](https://github.com/flexflow/FlexFlow/pkgs/container/flexflow-hip_rocm) with a HIP-ROCM backend) with all dependencies pre-installed (N.B.: currently, the CUDA pre-built containers are only fully compatible with host machines that have CUDA 11.7 installed), together with [Dockerfiles](./docker) if you wish to build the containers manually. 
 
 ## Run SpecInfer
 The source code of the SpecInfer pipeline is available at [this folder](../inference/spec_infer/). The SpecInfer executable will be available at `/build_dir/inference/spec_infer/spec_infer` at compilation. You can use the following command-line arguments to run SpecInfer:
 
 * `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
 * `-ll:fsize`: size of device memory on each GPU in MB
 * `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. SpecInfer keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
+* `-llm-model`: the LLM model type as a case-insensitive string (e.g. "opt" or "llama")
 * `-llm-weight`: path to the folder that stores the LLM weights
-* `-ssm-weight`: path to the folder that stores the small speculative models' weights. You can use multiple `-ssm-weight`s in the command line to launch multiple SSMs.
+* `-llm-config`: path to the json file that stores the LLM model configs
+* `-ssm-model`: the LLM model type as a case-insensitive string (e.g. "opt" or "llama"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
+* `-ssm-weight`: path to the folder that stores the small speculative models' weights. The number of `-ssm-weight`s must match the number of `-ssm-model`s and `-ssm-config`s.
+* `-ssm-config`: path to the json file that stores the SSM model configs. The number of `-ssm-config`s must match the number of `-ssm-model`s and `-ssm-weight`s.
 * `-tokenizer`: path to the tokenizer file (see [Tokenizers](#tokenizers) for preparing a tokenizer for SpecInfer).
 * `-prompt`: (optional) path to the prompt file. SpecInfer expects a json format file for prompts, all of which will be served by SpecInfer. In addition, users can also use the following API for registering requests:
 
@@ -47,10 +51,10 @@ class RequestManager {
   RequestGuid register_new_request(std::string const &prompt, int max_sequence_length);
 }
 ```
-For example, you can use the following command line to serve a LLaMA-6B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-190M models for speculative inference.
+For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-190M models for speculative inference.
 
 ```bash
-./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-weight /path/to/llm/weights -ssm-weight /path/to/ssm1/weights -smm-weight /path/to/ssm2/weights -tokenizer /path/to/tokenizer.model -prompt /path/to/prompt.json
+./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model llama -llm-weight /path/to/llm/weights -llm-config /path/to/llm/config.json -ssm-model llama -ssm-weight /path/to/ssm1/weights -ssm-config /path/to/ssm/config.json -ssm-model llama -smm-weight /path/to/ssm2/weights -ssm-config /path/to/ssm2/config.json -tokenizer /path/to/tokenizer.model -prompt /path/to/prompt.json
 ```
 
 ### Tokenizers
@@ -60,7 +64,7 @@ SpecInfer supports two tokenizers:
 * The GPT2 tokenizer is used to support the Open Pre-trained Transformer model family (e.g., OPT-13B and OPT-125M). To use it, download the [vocab](https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-vocab.json) and [merges](https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-merges.txt) files and pass the folder containing them as a parameter. 
 
 ### LLM Weights
-The weight files using in our demo is extracted from HuggingFace, and stored in our AWS S3 bucket.
+The weight files used in our demo are extracted from HuggingFace, and stored in our AWS S3 bucket.
 
 |  Model   | Model id on Hugging Face  | Storage Location |
 |  :----  | :----  | :----  |
@@ -69,11 +73,14 @@ The weight files using in our demo is extracted from HuggingFace, and stored in
 | OPT-6.7B  | facebook/opt-6.7b | s3://specinfer/weights/opt_6B_weights.tar.gz |
 | OPT-125M  | facebook/opt-125m | s3://specinfer/weights/opt_125m_native.tar.gz |
 
-You can use [this script](../inference/spec_infer/MODEL_WEIGHTS.md) to convert the weights of a HuggingFace LLM to the SpecInfer weight format.
+You can use [this script](../inference/utils/download_llama_weights.py) to automatically download and convert the weights of a HuggingFace LLAMA LLM and a LLAMA SSM to the SpecInfer weight format. The script also downloads the LLAMA tokenizer. If you would like to try the OPT model instead, use [this script](../inference/utils/download_opt_weights.py) to download (and convert) the OPT weights and tokenizer.
 
 ### Prompt Datasets
 We have evaluated SpecInfer on the following prompts datasets: [Chatbot instruction prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatbot.json), [ChatGPT Prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json), [WebQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/webqa.json), [Alpaca](https://specinfer.s3.us-east-2.amazonaws.com/prompts/alpaca.json), and [PIQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/piqa.json).
 
+### Script to run the demo
+You can take a look at [this script](../tests/inference_tests.sh), which is run in CI for each new commit, for an example of how to run the demo.
+
 ## Difference between SpecInfer and HuggingFace Assistant Model
 
 There are two major differences between the two systems.

diff --git a/.github/workflows/build-skip.yml b/.github/workflows/build-skip.yml
@@ -3,6 +3,7 @@ on:
   pull_request:
     paths-ignore:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "python/**"

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -3,6 +3,7 @@ on:
   pull_request:
     paths:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "python/**"
@@ -14,6 +15,7 @@ on:
       - "master"
     paths:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "python/**"

diff --git a/.github/workflows/clang-format-check.yml b/.github/workflows/clang-format-check.yml
@@ -10,6 +10,7 @@ jobs:
           - check: "src"
             exclude: '\.proto$'
           - check: "include"
+          - check: "inference"
           - check: "nmt"
           - check: "python"
           - check: "scripts"

diff --git a/.github/workflows/gpu-ci.yml b/.github/workflows/gpu-ci.yml
@@ -7,6 +7,7 @@ on:
       - "python/**"
       - "setup.py"
       - "include/**"
+      - "inference/**"
       - "src/**"
       - ".github/workflows/gpu-ci.yml"
       - "tests/cpp_gpu_tests.sh"
@@ -21,6 +22,7 @@ on:
       - "python/**"
       - "setup.py"
       - "include/**"
+      - "inference/**"
       - "src/**"
       - ".github/workflows/gpu-ci.yml"
       - "tests/cpp_gpu_tests.sh"
@@ -122,10 +124,64 @@ jobs:
           export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
           ./tests/align/test_all_operators.sh
 
+  inference-tests:
+    name: Inference Tests
+    runs-on: self-hosted
+    defaults:
+      run:
+        shell: bash -l {0} # required to use an activated conda environment
+    env: 
+      CONDA: "3"    
+    needs: gpu-ci-concierge
+    container:
+      image: ghcr.io/flexflow/flexflow-environment-cuda:latest
+      options: --gpus all --shm-size=8192m
+    steps:
+      - name: Install updated git version
+        run: sudo add-apt-repository ppa:git-core/ppa -y && sudo apt update -y && sudo apt install -y --no-install-recommends git
+
+      - name: Checkout Git Repository
+        uses: actions/checkout@v3
+        with:
+          submodules: recursive
+
+      - name: Install conda and FlexFlow dependencies
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          miniconda-version: "latest"
+          activate-environment: flexflow
+          environment-file: conda/flexflow-cpu.yml
+          auto-activate-base: false
+
+      - name: Build FlexFlow
+        run: |
+          export PATH=$CONDA_PREFIX/bin:$PATH
+          export FF_HOME=$(pwd)
+          export FF_USE_PREBUILT_LEGION=OFF #remove this after fixing python path issue in Legion
+          export FF_BUILD_ALL_INFERENCE_EXAMPLES=ON
+          mkdir build
+          cd build
+          ../config/config.linux
+          make -j
+
+      - name: Run inference tests
+        run: |
+          export PATH=$CONDA_PREFIX/bin:$PATH
+          export FF_HOME=$(pwd)
+          export CUDNN_DIR=/usr/local/cuda
+          export CUDA_DIR=/usr/local/cuda
+          export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
+
+          # GPT tokenizer test
+          ./tests/gpt_tokenizer_test.sh
+
+          # Inference tests
+          ./tests/inference_tests.sh
+
   gpu-ci-flexflow:
     name: Single Machine, Multiple GPUs Tests
     runs-on: self-hosted
-    needs: gpu-ci-concierge
+    needs: inference-tests
     container:
       image: ghcr.io/flexflow/flexflow-environment-cuda:latest
       options: --gpus all --shm-size=8192m
@@ -162,8 +218,6 @@ jobs:
           export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/lib
           # C++ tests
           ./tests/cpp_gpu_tests.sh 4
-          # GPT tokenizer test
-          ./tests/gpt_tokenizer_test.sh
           # Python tests
           ./tests/multi_gpu_tests.sh 4
 
diff --git a/.gitignore b/.gitignore
@@ -181,3 +181,4 @@ train-labels-idx1-ubyte
 
 # Logs
 logs/
+gpt_tokenizer
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -550,21 +550,10 @@ if(FF_BUILD_MOE OR FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
 endif()
 
 if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
-  add_subdirectory(examples/cpp/inference/LLAMA)
-endif()
-
-if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
-  add_subdirectory(examples/cpp/inference/opt)
-endif()
-
-if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
-  add_subdirectory(examples/cpp/inference/llama_spec_pipeline)
   add_subdirectory(inference/spec_infer)
+  add_subdirectory(inference/incr_decoding)
 endif()
 
-if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
-  add_subdirectory(examples/cpp/inference/opt_spec_pipeline)
-endif()
 
 # installation
 set(INCLUDE_DEST "include")

diff --git a/conda/flexflow-cpu.yml b/conda/flexflow-cpu.yml
@@ -17,3 +17,4 @@ dependencies:
     - torch --index-url https://download.pytorch.org/whl/cpu
     - torchaudio --index-url https://download.pytorch.org/whl/cpu
     - torchvision --index-url https://download.pytorch.org/whl/cpu
+    - regex
diff --git a/examples/cpp/inference/LLAMA/Makefile b/examples/cpp/inference/LLAMA/Makefile