[0.6.0] Release Candidate (#481) · PygmalionAI/aphrodite-engine@f1d0b77

Commit

[0.6.0] Release Candidate (#481)

* chore: skip the driver worker

* chore: bump lmfe version to 0.10.3

* chore: some more marlin cleanups

* chore: deprecation warning for beam search

* feat: support FP8 for DeepSeekV2 MoE

* feat: add fuyu vision model and persimmon language model support

* fix: turn off cutlass scaled_mm for ada lovelace cards

* chore: allow quantizing all layers of deepseek-v2

* fix: build with pylimited api in the docker file

* OpenAI API Refactor (#591)

* feat: massive api server refactoring

* fix: tokenizer endpoint issues

* fix: BatchResponseData body should be optional

* chore: simplify pipeline parallel code in llama

* fix: convert image to RGB by default

* fix: allow getting the chat template from a url

* chore: avoid loading the unused layers and init the VLM up to the required feature space

* chore: enable bias w/ FP8 layers in CUTLASS kernels

* chore: upgrade flashinfer to 0.0.9

* feat: add custom triton cache manager

* chore: add CustomAP interface to UnquantizedFusedMoEMethod

* chore: handle aborted requests for jamba

* fix: minor fix for prompt adapter config

* feat: chat completions tokenization endpoint (#592)

* feat: optimize throughput to 1.4x by using numpy for token padding

* feat: MoE support with Pallas GMM kernel for TPUs

* chore: log spec decoding metrics

* chore: separate kv_scale into k_scale and v_scale

* feat: Asymmetric Tensor Parallel (#594)

* add utils for getting the partition offset and size for current tp rank

* disable asymmetric TP for quants and lora, handle GQA allocation

* the actual splitting work in the linear layers

* padding size for the vocab/lm_head should be optional

* cache engine and spec decode model runner (kwargs only)

* pass the tp_rank to model runners

* llama support

* update dockerfile

* let's not build these for now

* Revert "update dockerfile"

This reverts commit 6dd6408.

* fix: install wheel and packaging in docker

* fix: admin key arg

* let's try this again

* fix: mamba-ssm installation stuff

* chore: shutdown method for multiproc executor

* chore: log the message queue comms handle

* Port mamba kernels to Aphrodite (#595)

* kernels

* fix interface

* clean up dockerfile

* chore: set seed for dummy weights init

* fix: only create embeddings and lm_head when necessary for PP

* cleanup rocm dockerfile

* fix: some minor typing issues in spec decode

* fix: 4-node crash with PP

* chore: remove multimodal stuff from TPU

* fix: type annotation in worker

* refactor _prepare_model_input_tensor and attn metadata builder for most backends

* move prepare_inputs to the GPU (#596)

* add kernels

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

* sampler changes

* refactor the spec decode model runer and worker

---------

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

* update all benchmarks (#597)

* feat: add fp8 dynamic per-token quant kernel

* feat: pipeline parallel support for mixtral

* feat: add fp8 channel-wise weight quantization support

* feat: add asymmetric TP support for Qwen2

* feat: add CPU offloading support (#598)

* fix: avoid secondary error in ShmRingBuffer destructor

* feat: add SPMD worker execution using Ray  accelerated DAG

* `enable_gpu_advance_step` -> `allo_gpu_advance_step`

* chore: use the LoRA tokenizer in OpenAI API (#599)

* fix: use paged attention for bloc swapping/copying in flashinfer

* chore: refactor TPU model runner and worker

* chore: improve min_capability checking for `compressed-tensors`

* chore: implement fallback for fp8 channelwise using torch._scaled_mm

* fix: allow using mp executor for pipeline parallel

* fix: make speculative decoding work with per-request seed

* feat: non-uniform quantization via `compressed-tensors` for llama

* fix: the metrics endpoint was not mounted

* fix: raise an error for no draft token case when draft_tp>1

* chore: pass bias to quant_method.apply

* some small performance improvements

* update time since last collection for AsyncMetricsCollector

* fix shared memory bug w/ multi-node

* chore: enable dynamic per-token `fp8`

* docker: install libibverbs by default

* add scale_ub inputs to fp8 dynamic per-token quant

* chore: allow specifying custom Executor

* fix: request abort crashing pipeline parallel

* feat: fbgemm quantization support (#601)

* feat: fbgemm support

* missed this one

* register the quant

* chore: minor AMD fixes

* feat: support fbgemm_fp8 quant on ampere

* fix: input_scale for w8a8 is optional

* fix: channel-wise fp8 marlin

* move `aphrodite.endpoints.openai.chat_utils` -> `aphrodite.endpoints.chat_utils`

* feat: disable logprob serialization to CPU for spec decode

* chore: refactor and decouple phi3v image embedding

* fix: asymmetric TP changes breaking the gptq and awq quants (#602)

* feat: AWQ marlin kernels (#603)

* refactor gptq_marlin kernels to add awq support

* integrate

* fix: short commit hash import error

* feat: initial text-to-text support for Chameleon model

* clean up requirements

* chore: add a wrapper for torch.inference_mode decorator

* fix: `vocab_size` field access in llava

* fix: f-string fixes

* docs: add doc site with example content

* docs: add installation guides

* chore: add contribution guidelines + Code of Conduct (#507)

* add utils; proper project structuring

* add contributing guidelines and CoC

* remove docker

* chore: add contribution guidelines + Code of Conduct (#507)

* add utils; proper project structuring

* add contributing guidelines and CoC

* Refactor prompt processing (#605)

* wip

* finish up the refactor

* fix: use int64_t for indices in fp8 kernels

* feat: support loading lora adapters directly from HF

* chore: modularize prepare input and attn metadata builder

* fix: fbgemm_fp8 when modules_to_not_convert=None

* chore: automatically enable chunked prefill if model has large seqlen

* chore: move some verbose logs to debug

* chore: further improve logging

* feat: support FP8 KV Cache scales from compressed-tensors

* feat: script for multi-node cluster setup

* chore: manage all http connections in one place

* feat: allow image inputs with chameleon models

* fix: support ignore patterns in model loader

* fix: beta value in gelu_tanh kernel being divided by 0.5

* fix: some naming issues

* chore: add ignored layers for fp8 quant

* chore: add usage data in each chunk for serving_chat

* bump transformers

* fix: cache spec decode metrics when they get collected

* feat: support loading pre-quanted bnb checkpoints

* fix: flashinfer cuda graph capture with pipeline parallel

* fix: token padding for chameleon

* fix: `ignore_patterns` -> `ignore_file_pattern` for modelscope

* fix: miscalculated latency leading to ttft inaccuracy

* chore: split `run_server` into `build_server` and `run_server`

* chore: add fp8 support to `reshape_and_cache_flash`

* chore: tweaks to model/runner builder developer APIs

* chore: bump transformers

* fix: zmq hangs with large requests

* chore: represent tokens with identifiable strings

* feat: add support for MiniCPM-V

* fix: decode tokens w/ CUDA graphs and graps with flashinfer

* fix: allow passing -q {gptq,awq}_marlin as the arg

* fix: encoding format for embedding example

* fix: add image placeholder for openai server for minicpmv

* feat: `fp8-marlin` channel-wise quant via `compressed-tensors`

* fix: `kv_cache_dtype=fp8` without scales for fp8 checkpoints

* fix: prevent possible data race by adding sync

* fix: nulltr channelwise scales when loading wNa16 models

* fix: define self.forward_dag before init_workers_ray

* fix: replicatedlinear weight loading

* fix: pass signal from the main thread

* chore: use array to speedup padding

* feat: add nemotron HF support (#606)

* fix: promote another index in fp8 kernel to int64_t

* chore: minor simplifications for Dockerfile.rocm

* chore: allow initializing TPU in initialize_ray_cluster

* feat: tensor parallelism for CPU backend

* chore: simplify squared relu

* fix: disable enforce_eager for bnb

* feat: support collective comms in XLA devices, e.g. TPUs

* chore: factor out the code for running uvicorn

* fix: illegal mem access for fp8 l3.1 405b

* fix: torch nightly version for rocm dockerfile

* fix: do not enable chunked prefill and prefix caching for jamba

* feat: add support for head_size of 120

* feat: tensor parallelism for TPU with ray

* fix: torch.set_num_threads() in multiproc_gpu_executor

* chore: consolidate all vision examples into one file

* feat: add blip-2 support

* chore: reduce XLA compile times

* fix: better logging for memory profiling

* chore: perform allreduce in fp32 for marlin, better logging

* fix: add nemotron to PP_SUPPORTED_MODELS

* fix: pass cutlass_fp8_supported correctly for fbgemm_fp8

* feat: add internvl support

* chore: tune fp8 kernels for ada lovelace cards

* fix: reduce unnecessary compute when logprobs=None

* fix: deprecation warnings in squeezellm quant_cuda_kernel

* chore: enable tpu tensor parallel in async engine

* fix: remove timm as a hardcoded requirement

* chore: make triton fully optional

* feat: add allowed_token_ids

* fix: unused variables in awq gemm kernel

* fix: divide-by-zero warnings in marlin kernels

* chore: tune int8 kernels for ada lovelace

* fix: greedy decoding in TPU

* fix: paligemma mmp

* fix: seeded gens with pipeline parallel

* fix: compiler warnings for _C and _moe

* chore: bump openvino toolkit to pre-release

* fix: remove scaled_fp8_quant_kernel padding footgun

* fix: massively improve throughput with high number of prompts

* fix: remove artifact

* chore: add punica sizes for mistral nemo

* fix: conditionally import outlines.caching

* fix: wrap all outlines imports

* chore: sort args (#608)

* chore: sort args

* Update aphrodite/modeling/guided_decoding/outlines_logits_processors.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* Update aphrodite/engine/args_tools.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* Update aphrodite/engine/args_tools.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* fix: Device Options category

* fix: Load Options category with load_format, dtype and ignore_patterns

* fix: API Options -> Inference Options with seed, served_model_name moved to model options

* fix: categorize based off of config.py and wrong category names

* fix: move `model` arg to the top

* fix: add missing `max_seq_lens_to_capture` arg

* chore: sort the device arg

* chore: move `model` arg to the top of the argparser

* chore: remove old comment

---------

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>
Co-authored-by: AlpinDale <alpindale@gmail.com>

* fix: formatting

* feat: add yaml config parsing (#610)

* feat: add yaml config parsing

* fix: prompt adapters

* chore: add isort and refactor formatting script and utils

* fix: broadcasting logic for multi_modal_kwargs

* fix: set readonly=True for non-root TPU devices

* fix: logit processor exceeding vocab size

* feat: support for QQQ W4A8 quantization (#612)

* feat: add qqq marlin kernels

* integrate qqq quant

* fix: cleanup minicpm-v and port na_vit model

* fix: feature size calculation for Llava-next

* feat: use FusedMoE for jamba

* feat: allow loading specific layer numbers per device

* fix: fp8 marlin and cpu offloading with fp8 marlin

* chore: enable fp8 cutlass for ada lovelace

* chore; tune cutlass int8 kernels for sm_75

* feat: Triton Kernels for Punica (#613)

* feat: replace CUDA kernels w/ triton for lora

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* fix: __init__ in ops

* fix: replicated linear layer support for LoRA

* cleanup and relax the conditions for vocab_size and rank

* fix: add consolidated* to ignore_patterns

---------

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* chore: add pipeline parallel support for Qwen

* fix: don't use torch.generator() for TPU

* fix: skip loading lm_head for tie_word_embeddings models

* chore: optimize PP comm by replacing send with partial send + allgather

* fix: set a default max_tokens for OAI requests

* chore: optimize scheduler and remove policy

* chore: bump torch to 2.4.0

* bump to torch 2.4.0, add aphrodite_flash_attn (#614)

* fix: RMSNorm forward in InternViT attention qk_layernorm

* 'fix: lower gemma's unloaded_params exception to warning

* feat: support logits soft capping with flash attention backend

* chore: optimize get_seqs

* fix: input shape for flashinfer prefill wrapper

* fix: remove error_on_invalid_device_count_status

* fix: remove unused code in sampler

* build: update torch to 2.4.0 for cpu

* kernels: disambiguate quantized types via a new ScalarType

Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

* chore: pipeline parallel with Ray accelerated dag

* fix: use loopback address for single node again

* chore: add env var to enable torch.compile

* revert: incorrect nightly build

* feat: add RPC server and client via ZMQ (#615)

* feat: add RPC server and client

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* add async engine protocols

* add new methods to sync and async engine

* producer utils

* migrate serving engine, embedding and tokenization to rpc

* migrate text completions

* migrate chat completions

* migrate logits processors api

* migrate api server

* forgot the arg

* minor naming issues

---------

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* refactor: factor out chat message parsing

* refactor: add has_prefix_cache_hit flag to FlashAttentionMetadataBuilder

* feat: add guided decoding to LLM

* chore: simplify output processing with shortcut for non-parallel sampling and non-beam search usecase (#616)

* refactor: minicpmv and port Idefix2VisionTransformer

* refactor: factor out code for running uvicorn again

* feat: port SiglipVisionModel from transformers

* chore: add proper logging for spec decoding verification

* fix: support flashinfer for draft model runner

* fix: use ipv4 localhost form for zmq bind

* fix: use args.trust_remote_code

* chore: update cutlass to 3.5.1

* fix: specify device when loading lora and embedding tensors

* feat: non-blocking transfer in prepare_input

* feat: re-add GGUF (#600)

* refactor gguf kernels

* fix: incorrect filename for vecdotq header

* finish up the re-impl

* add requirements

* minor CI fixes

* ci: a few more ignores

* ci: remove clang-format

* ci: take one of fixing lint issues

* ci: codespell fixes

* ci: remove yapf

* ci: remove yapf from the formatting script

* ci: remove isort

* chore: minor cleanups

* fix: allow loading GGUF model without .gguf extension

* fix: cpu offloading with gptq

* docs: finalize User & Developer Documentation for Release Candidate (#618)

* add getting started page

* add debugging tips

* add openai docs

* add distributed guide

* add production metrics and model support matrix

* add guide for adding new models

* huge update

* add vlm usage docs

* ci: add action for deploying docs

* docs: fix typos

* chore: refactor wheel build script

* bump version to 0.6.0

---------

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Ahmed <mail@ahme.dev>
Co-authored-by: ewof <elwolf6@protonmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

Loading branch information

11 people committed Sep 3, 2024

1 parent cbb9f85 commit f1d0b77

.clang-format

-Original file line number
+Diff line change
@@ -0,0 +1,26 @@
+    BasedOnStyle: Google
+    UseTab: Never
+    IndentWidth: 2
+    ColumnLimit: 80
+    # Force pointers to the type for C++.
+    DerivePointerAlignment: false
+    PointerAlignment: Left
+    # Reordering #include statements can (and currently will) introduce errors
+    SortIncludes: false
+    # Style choices
+    AlignConsecutiveAssignments: false
+    AlignConsecutiveDeclarations: false
+    IndentPPDirectives: BeforeHash
+    IncludeCategories:
+      - Regex:           '^<'
+        Priority:        4
+      - Regex:           '^"(llvm|llvm-c|clang|clang-c|mlir|mlir-c)/'
+        Priority:        3
+      - Regex:           '^"(qoda|\.\.)/'
+        Priority:        2
+      - Regex:           '.*'
+        Priority:        1

.dockerignore

-Original file line number
+Diff line change
@@ -0,0 +1,3 @@
+    conda/
+    build/
+    venv/

0 docker/.env → .env

File renamed without changes.

.github/workflows/deploy.yml

-Original file line number
+Diff line change
@@ -0,0 +1,64 @@
+    # Sample workflow for building and deploying a VitePress site to GitHub Pages
+    #
+    name: Deploy VitePress site to Pages
+    on:
+      # Runs on pushes targeting the `main` branch. Change this to `master` if you're
+      # using the `master` branch as the default branch.
+      push:
+        branches: [main]
+      # Allows you to run this workflow manually from the Actions tab
+      workflow_dispatch:
+    # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
+    permissions:
+      contents: read
+      pages: write
+      id-token: write
+    # Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
+    # However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
+    concurrency:
+      group: pages
+      cancel-in-progress: false
+    jobs:
+      # Build job
+      build:
+        runs-on: ubuntu-latest
+        steps:
+          - name: Checkout
+            uses: actions/checkout@v4
+            with:
+              fetch-depth: 0 # Not needed if lastUpdated is not enabled
+          - uses: pnpm/action-setup@v3 # Uncomment this if you're using pnpm
+          # - uses: oven-sh/setup-bun@v1 # Uncomment this if you're using Bun
+          - name: Setup Node
+            uses: actions/setup-node@v4
+            with:
+              node-version: 20
+              cache: pnpm # or pnpm / yarn
+          - name: Setup Pages
+            uses: actions/configure-pages@v4
+          - name: Install dependencies
+            run: pnpm install # or pnpm install / yarn install / bun install
+          - name: Build with VitePress
+            run: pnpm docs:build # or pnpm docs:build / yarn docs:build / bun run docs:build
+          - name: Upload artifact
+            uses: actions/upload-pages-artifact@v3
+            with:
+              path: docs/.vitepress/dist
+      # Deployment job
+      deploy:
+        environment:
+          name: github-pages
+          url: ${{ steps.deployment.outputs.page_url }}
+        needs: build
+        runs-on: ubuntu-latest
+        name: Deploy
+        steps:
+          - name: Deploy to GitHub Pages
+            id: deployment
+            uses: actions/deploy-pages@v4

.github/workflows/publish.yml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -48,9 +48,9 @@ jobs:
  
          fail-fast: false

          matrix:

              os: ['ubuntu-20.04']

              python-version: ['3.8', '3.9', '3.10', '3.11']

              pytorch-version: ['2.3.0']  # Must be the most recent version that meets requirements-cuda.txt.

              cuda-version: ['11.8', '12.1']

              python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']

              pytorch-version: ['2.4.0']  # Must be the most recent version that meets requirements-cuda.txt.

              cuda-version: ['12.4', '12.1', '11.8']

        steps:

          - name: Checkout

    @@ -76,6 +76,8 @@ jobs:
  
          - name: Build wheel

            shell: bash

            env:

              CMAKE_BUILD_TYPE: Release

            run: |

              bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}

              wheel_name=$(ls dist/*whl | xargs -n 1 basename)

.github/workflows/ruff.yml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -15,7 +15,7 @@ jobs:
  
        runs-on: ubuntu-latest

        strategy:

          matrix:

            python-version: ["3.10"]

            python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]

        steps:

        - uses: actions/checkout@v2

        - name: Set up Python ${{ matrix.python-version }}

    @@ -25,10 +25,10 @@ jobs:
  
        - name: Install dependencies

          run: |

            python -m pip install --upgrade pip

            pip install ruff==0.1.5 codespell==2.2.6 tomli==2.0.1

            pip install ruff==0.1.5 codespell==2.3.0 tomli==2.0.1

        - name: Analysing the code with ruff

          run: |

            ruff aphrodite tests

            ruff .

        - name: Spelling check with codespell

          run: |

             codespell --toml pyproject.toml
      
            codespell --toml pyproject.toml

.github/workflows/yapf.yml

This file was deleted.

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -6,12 +6,12 @@ repos @@
     *.so
     .conda
     build
-    dist*
     .VSCodeCounter
     conda/
     umamba.exe
     bin/
     *.whl
+    aphrodite/commit_id.py
     # Byte-compiled / optimized / DLL files
     __pycache__/
     *.py[cod]
@@ Expand Down Expand Up / @@ -198,4 +198,5 @@ _build/ @@
     kv_cache_states/*
     quant_params/*
-    .ruff_cache/
+    .ruff_cache/
+    images/

0 comments on commit `f1d0b77`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `f1d0b77`

Commit

There are no files selected for viewing

0 comments on commit f1d0b77

0 comments on commit `f1d0b77`