Skip to content

Commit

Permalink
[0.6.0] Release Candidate (#481)
Browse files Browse the repository at this point in the history
* chore: skip the driver worker

* chore: bump lmfe version to 0.10.3

* chore: some more marlin cleanups

* chore: deprecation warning for beam search

* feat: support FP8 for DeepSeekV2 MoE

* feat: add fuyu vision model and persimmon language model support

* fix: turn off cutlass scaled_mm for ada lovelace cards

* chore: allow quantizing all layers of deepseek-v2

* fix: build with pylimited api in the docker file

* OpenAI API Refactor (#591)

* feat: massive api server refactoring

* fix: tokenizer endpoint issues

* fix: BatchResponseData body should be optional

* chore: simplify pipeline parallel code in llama

* fix: convert image to RGB by default

* fix: allow getting the chat template from a url

* chore: avoid loading the unused layers and init the VLM up to the required feature space

* chore: enable bias w/ FP8 layers in CUTLASS kernels

* chore: upgrade flashinfer to 0.0.9

* feat: add custom triton cache manager

* chore: add CustomAP interface to UnquantizedFusedMoEMethod

* chore: handle aborted requests for jamba

* fix: minor fix for prompt adapter config

* feat: chat completions tokenization endpoint (#592)

* feat: optimize throughput to 1.4x by using numpy for token padding

* feat: MoE support with Pallas GMM kernel for TPUs

* chore: log spec decoding metrics

* chore: separate kv_scale into k_scale and v_scale

* feat: Asymmetric Tensor Parallel (#594)

* add utils for getting the partition offset and size for current tp rank

* disable asymmetric TP for quants and lora, handle GQA allocation

* the actual splitting work in the linear layers

* padding size for the vocab/lm_head should be optional

* cache engine and spec decode model runner (kwargs only)

* pass the tp_rank to model runners

* llama support

* update dockerfile

* let's not build these for now

* Revert "update dockerfile"

This reverts commit 6dd6408.

* fix: install wheel and packaging in docker

* fix: admin key arg

* let's try this again

* fix: mamba-ssm installation stuff

* chore: shutdown method for multiproc executor

* chore: log the message queue comms handle

* Port mamba kernels to Aphrodite (#595)

* kernels

* fix interface

* clean up dockerfile

* chore: set seed for dummy weights init

* fix: only create embeddings and lm_head when necessary for PP

* cleanup rocm dockerfile

* fix: some minor typing issues in spec decode

* fix: 4-node crash with PP

* chore: remove multimodal stuff from TPU

* fix: type annotation in worker

* refactor _prepare_model_input_tensor and attn metadata builder for most backends

* move prepare_inputs to the GPU (#596)

* add kernels

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

* sampler changes

* refactor the spec decode model runer and worker

---------

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

* update all benchmarks (#597)

* feat: add fp8 dynamic per-token quant kernel

* feat: pipeline parallel support for mixtral

* feat: add fp8 channel-wise weight quantization support

* feat: add asymmetric TP support for Qwen2

* feat: add CPU offloading support (#598)

* fix: avoid secondary error in ShmRingBuffer destructor

* feat: add SPMD worker execution using Ray  accelerated DAG

* `enable_gpu_advance_step` -> `allo_gpu_advance_step`

* chore: use the LoRA tokenizer in OpenAI API (#599)

* fix: use paged attention for bloc swapping/copying in flashinfer

* chore: refactor TPU model runner and worker

* chore: improve min_capability checking for `compressed-tensors`

* chore: implement fallback for fp8 channelwise using torch._scaled_mm

* fix: allow using mp executor for pipeline parallel

* fix: make speculative decoding work with per-request seed

* feat: non-uniform quantization via `compressed-tensors` for llama

* fix: the metrics endpoint was not mounted

* fix: raise an error for no draft token case when draft_tp>1

* chore: pass bias to quant_method.apply

* some small performance improvements

* update time since last collection for AsyncMetricsCollector

* fix shared memory bug w/ multi-node

* chore: enable dynamic per-token `fp8`

* docker: install libibverbs by default

* add scale_ub inputs to fp8 dynamic per-token quant

* chore: allow specifying custom Executor

* fix: request abort crashing pipeline parallel

* feat: fbgemm quantization support (#601)

* feat: fbgemm support

* missed this one

* register the quant

* chore: minor AMD fixes

* feat: support fbgemm_fp8 quant on ampere

* fix: input_scale for w8a8 is optional

* fix: channel-wise fp8 marlin

* move `aphrodite.endpoints.openai.chat_utils` -> `aphrodite.endpoints.chat_utils`

* feat: disable logprob serialization to CPU for spec decode

* chore: refactor and decouple phi3v image embedding

* fix: asymmetric TP changes breaking the gptq and awq quants (#602)

* feat: AWQ marlin kernels (#603)

* refactor gptq_marlin kernels to add awq support

* integrate

* fix: short commit hash import error

* feat: initial text-to-text support for Chameleon model

* clean up requirements

* chore: add a wrapper for torch.inference_mode decorator

* fix: `vocab_size` field access in llava

* fix: f-string fixes

* docs: add doc site with example content

* docs: add installation guides

* chore: add contribution guidelines + Code of Conduct (#507)

* add utils; proper project structuring

* add contributing guidelines and CoC

* remove docker

* chore: add contribution guidelines + Code of Conduct (#507)

* add utils; proper project structuring

* add contributing guidelines and CoC

* Refactor prompt processing (#605)

* wip

* finish up the refactor

* fix: use int64_t for indices in fp8 kernels

* feat: support loading lora adapters directly from HF

* chore: modularize prepare input and attn metadata builder

* fix: fbgemm_fp8 when modules_to_not_convert=None

* chore: automatically enable chunked prefill if model has large seqlen

* chore: move some verbose logs to debug

* chore: further improve logging

* feat: support FP8 KV Cache scales from compressed-tensors

* feat: script for multi-node cluster setup

* chore: manage all http connections in one place

* feat: allow image inputs with chameleon models

* fix: support ignore patterns in model loader

* fix: beta value in gelu_tanh kernel being divided by 0.5

* fix: some naming issues

* chore: add ignored layers for fp8 quant

* chore: add usage data in each chunk for serving_chat

* bump transformers

* fix: cache spec decode metrics when they get collected

* feat: support loading pre-quanted bnb checkpoints

* fix: flashinfer cuda graph capture with pipeline parallel

* fix: token padding for chameleon

* fix: `ignore_patterns` -> `ignore_file_pattern` for modelscope

* fix: miscalculated latency leading to ttft inaccuracy

* chore: split `run_server` into `build_server` and `run_server`

* chore: add fp8 support to `reshape_and_cache_flash`

* chore: tweaks to model/runner builder developer APIs

* chore: bump transformers

* fix: zmq hangs with large requests

* chore: represent tokens with identifiable strings

* feat: add support for MiniCPM-V

* fix: decode tokens w/ CUDA graphs and graps with flashinfer

* fix: allow passing -q {gptq,awq}_marlin as the arg

* fix: encoding format for embedding example

* fix: add image placeholder for openai server for minicpmv

* feat: `fp8-marlin` channel-wise quant via `compressed-tensors`

* fix: `kv_cache_dtype=fp8` without scales for fp8 checkpoints

* fix: prevent possible data race by adding sync

* fix: nulltr channelwise scales when loading wNa16 models

* fix: define self.forward_dag before init_workers_ray

* fix: replicatedlinear weight loading

* fix: pass signal from the main thread

* chore: use array to speedup padding

* feat: add nemotron HF support (#606)

* fix: promote another index in fp8 kernel to int64_t

* chore: minor simplifications for Dockerfile.rocm

* chore: allow initializing TPU in initialize_ray_cluster

* feat: tensor parallelism for CPU backend

* chore: simplify squared relu

* fix: disable enforce_eager for bnb

* feat: support collective comms in XLA devices, e.g. TPUs

* chore: factor out the code for running uvicorn

* fix: illegal mem access for fp8 l3.1 405b

* fix: torch nightly version for rocm dockerfile

* fix: do not enable chunked prefill and prefix caching for jamba

* feat: add support for head_size of 120

* feat: tensor parallelism for TPU with ray

* fix: torch.set_num_threads() in multiproc_gpu_executor

* chore: consolidate all vision examples into one file

* feat: add blip-2 support

* chore: reduce XLA compile times

* fix: better logging for memory profiling

* chore: perform allreduce in fp32 for marlin, better logging

* fix: add nemotron to PP_SUPPORTED_MODELS

* fix: pass cutlass_fp8_supported correctly for fbgemm_fp8

* feat: add internvl support

* chore: tune fp8 kernels for ada lovelace cards

* fix: reduce unnecessary compute when logprobs=None

* fix: deprecation warnings in squeezellm quant_cuda_kernel

* chore: enable tpu tensor parallel in async engine

* fix: remove timm as a hardcoded requirement

* chore: make triton fully optional

* feat: add allowed_token_ids

* fix: unused variables in awq gemm kernel

* fix: divide-by-zero warnings in marlin kernels

* chore: tune int8 kernels for ada lovelace

* fix: greedy decoding in TPU

* fix: paligemma mmp

* fix: seeded gens with pipeline parallel

* fix: compiler warnings for _C and _moe

* chore: bump openvino toolkit to pre-release

* fix: remove scaled_fp8_quant_kernel padding footgun

* fix: massively improve throughput with high number of prompts

* fix: remove artifact

* chore: add punica sizes for mistral nemo

* fix: conditionally import outlines.caching

* fix: wrap all outlines imports

* chore: sort args (#608)

* chore: sort args

* Update aphrodite/modeling/guided_decoding/outlines_logits_processors.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* Update aphrodite/engine/args_tools.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* Update aphrodite/engine/args_tools.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* fix: Device Options category

* fix: Load Options category with load_format, dtype and ignore_patterns

* fix: API Options -> Inference Options with seed, served_model_name moved to model options

* fix: categorize based off of config.py and wrong category names

* fix: move `model` arg to the top

* fix: add missing `max_seq_lens_to_capture` arg

* chore: sort the device arg

* chore: move `model` arg to the top of the argparser

* chore: remove old comment

---------

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>
Co-authored-by: AlpinDale <alpindale@gmail.com>

* fix: formatting

* feat: add yaml config parsing (#610)

* feat: add yaml config parsing

* fix: prompt adapters

* chore: add isort and refactor formatting script and utils

* fix: broadcasting logic for multi_modal_kwargs

* fix: set readonly=True for non-root TPU devices

* fix: logit processor exceeding vocab size

* feat: support for QQQ W4A8 quantization (#612)

* feat: add qqq marlin kernels

* integrate qqq quant

* fix: cleanup minicpm-v and port na_vit model

* fix: feature size calculation for Llava-next

* feat: use FusedMoE for jamba

* feat: allow loading specific layer numbers per device

* fix: fp8 marlin and cpu offloading with fp8 marlin

* chore: enable fp8 cutlass for ada lovelace

* chore; tune cutlass int8 kernels for sm_75

* feat: Triton Kernels for Punica (#613)

* feat: replace CUDA kernels w/ triton for lora

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* fix: __init__ in ops

* fix: replicated linear layer support for LoRA

* cleanup and relax the conditions for vocab_size and rank

* fix: add consolidated* to ignore_patterns

---------

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* chore: add pipeline parallel support for Qwen

* fix: don't use torch.generator() for TPU

* fix: skip loading lm_head for tie_word_embeddings models

* chore: optimize PP comm by replacing send with partial send + allgather

* fix: set a default max_tokens for OAI requests

* chore: optimize scheduler and remove policy

* chore: bump torch to 2.4.0

* bump to torch 2.4.0, add aphrodite_flash_attn (#614)

* fix: RMSNorm forward in InternViT attention qk_layernorm

* 'fix: lower gemma's unloaded_params exception to warning

* feat: support logits soft capping with flash attention backend

* chore: optimize get_seqs

* fix: input shape for flashinfer prefill wrapper

* fix: remove error_on_invalid_device_count_status

* fix: remove unused code in sampler

* build: update torch to 2.4.0 for cpu

* kernels: disambiguate quantized types via a new ScalarType

Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

* chore: pipeline parallel with Ray accelerated dag

* fix: use loopback address for single node again

* chore: add env var to enable torch.compile

* revert: incorrect nightly build

* feat: add RPC server and client via ZMQ (#615)

* feat: add RPC server and client

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* add async engine protocols

* add new methods to sync and async engine

* producer utils

* migrate serving engine, embedding and tokenization to rpc

* migrate text completions

* migrate chat completions

* migrate logits processors api

* migrate api server

* forgot the arg

* minor naming issues

---------

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* refactor: factor out chat message parsing

* refactor: add has_prefix_cache_hit flag to FlashAttentionMetadataBuilder

* feat: add guided decoding to LLM

* chore: simplify output processing with shortcut for non-parallel sampling and non-beam search usecase (#616)

* refactor: minicpmv and port Idefix2VisionTransformer

* refactor: factor out code for running uvicorn again

* feat: port SiglipVisionModel from transformers

* chore: add proper logging for spec decoding verification

* fix: support flashinfer for draft model runner

* fix: use ipv4 localhost form for zmq bind

* fix: use args.trust_remote_code

* chore: update cutlass to 3.5.1

* fix: specify device when loading lora and embedding tensors

* feat: non-blocking transfer in prepare_input

* feat: re-add GGUF (#600)

* refactor gguf kernels

* fix: incorrect filename for vecdotq header

* finish up the re-impl

* add requirements

* minor CI fixes

* ci: a few more ignores

* ci: remove clang-format

* ci: take one of fixing lint issues

* ci: codespell fixes

* ci: remove yapf

* ci: remove yapf from the formatting script

* ci: remove isort

* chore: minor cleanups

* fix: allow loading GGUF model without .gguf extension

* fix: cpu offloading with gptq

* docs: finalize User & Developer Documentation for Release Candidate (#618)

* add getting started page

* add debugging tips

* add openai docs

* add distributed guide

* add production metrics and model support matrix

* add guide for adding new models

* huge update

* add vlm usage docs

* ci: add action for deploying docs

* docs: fix typos

* chore: refactor wheel build script

* bump version to 0.6.0

---------

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Ahmed <mail@ahme.dev>
Co-authored-by: ewof <elwolf6@protonmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
  • Loading branch information
11 people committed Sep 3, 2024
1 parent cbb9f85 commit f1d0b77
Show file tree
Hide file tree
Showing 654 changed files with 118,338 additions and 34,355 deletions.
26 changes: 26 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
BasedOnStyle: Google
UseTab: Never
IndentWidth: 2
ColumnLimit: 80

# Force pointers to the type for C++.
DerivePointerAlignment: false
PointerAlignment: Left

# Reordering #include statements can (and currently will) introduce errors
SortIncludes: false

# Style choices
AlignConsecutiveAssignments: false
AlignConsecutiveDeclarations: false
IndentPPDirectives: BeforeHash

IncludeCategories:
- Regex: '^<'
Priority: 4
- Regex: '^"(llvm|llvm-c|clang|clang-c|mlir|mlir-c)/'
Priority: 3
- Regex: '^"(qoda|\.\.)/'
Priority: 2
- Regex: '.*'
Priority: 1
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
conda/
build/
venv/
File renamed without changes.
64 changes: 64 additions & 0 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Sample workflow for building and deploying a VitePress site to GitHub Pages
#
name: Deploy VitePress site to Pages

on:
# Runs on pushes targeting the `main` branch. Change this to `master` if you're
# using the `master` branch as the default branch.
push:
branches: [main]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
group: pages
cancel-in-progress: false

jobs:
# Build job
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0 # Not needed if lastUpdated is not enabled
- uses: pnpm/action-setup@v3 # Uncomment this if you're using pnpm
# - uses: oven-sh/setup-bun@v1 # Uncomment this if you're using Bun
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm # or pnpm / yarn
- name: Setup Pages
uses: actions/configure-pages@v4
- name: Install dependencies
run: pnpm install # or pnpm install / yarn install / bun install
- name: Build with VitePress
run: pnpm docs:build # or pnpm docs:build / yarn docs:build / bun run docs:build
- name: Upload artifact
uses: actions/upload-pages-artifact@v3
with:
path: docs/.vitepress/dist

# Deployment job
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
needs: build
runs-on: ubuntu-latest
name: Deploy
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
8 changes: 5 additions & 3 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ jobs:
fail-fast: false
matrix:
os: ['ubuntu-20.04']
python-version: ['3.8', '3.9', '3.10', '3.11']
pytorch-version: ['2.3.0'] # Must be the most recent version that meets requirements-cuda.txt.
cuda-version: ['11.8', '12.1']
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
pytorch-version: ['2.4.0'] # Must be the most recent version that meets requirements-cuda.txt.
cuda-version: ['12.4', '12.1', '11.8']

steps:
- name: Checkout
Expand All @@ -76,6 +76,8 @@ jobs:
- name: Build wheel
shell: bash
env:
CMAKE_BUILD_TYPE: Release
run: |
bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
wheel_name=$(ls dist/*whl | xargs -n 1 basename)
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/ruff.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand All @@ -25,10 +25,10 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install ruff==0.1.5 codespell==2.2.6 tomli==2.0.1
pip install ruff==0.1.5 codespell==2.3.0 tomli==2.0.1
- name: Analysing the code with ruff
run: |
ruff aphrodite tests
ruff .
- name: Spelling check with codespell
run: |
codespell --toml pyproject.toml
codespell --toml pyproject.toml
31 changes: 0 additions & 31 deletions .github/workflows/yapf.yml

This file was deleted.

5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ repos
*.so
.conda
build
dist*
.VSCodeCounter
conda/
umamba.exe
bin/
*.whl
aphrodite/commit_id.py
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -198,4 +198,5 @@ _build/

kv_cache_states/*
quant_params/*
.ruff_cache/
.ruff_cache/
images/
Loading

0 comments on commit f1d0b77

Please sign in to comment.