Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: use the LoRA tokenizer in OpenAI API #599

Merged
merged 1 commit into from
Aug 24, 2024
Merged

Conversation

AlpinDale
Copy link
Member

No description provided.

@AlpinDale AlpinDale merged commit a26f784 into rc_054 Aug 24, 2024
@AlpinDale AlpinDale deleted the lora-tokenizer-api branch August 24, 2024 22:49
50h100a pushed a commit to 50h100a/aphrodite-engine that referenced this pull request Sep 1, 2024
AlpinDale added a commit that referenced this pull request Sep 3, 2024
* chore: skip the driver worker

* chore: bump lmfe version to 0.10.3

* chore: some more marlin cleanups

* chore: deprecation warning for beam search

* feat: support FP8 for DeepSeekV2 MoE

* feat: add fuyu vision model and persimmon language model support

* fix: turn off cutlass scaled_mm for ada lovelace cards

* chore: allow quantizing all layers of deepseek-v2

* fix: build with pylimited api in the docker file

* OpenAI API Refactor (#591)

* feat: massive api server refactoring

* fix: tokenizer endpoint issues

* fix: BatchResponseData body should be optional

* chore: simplify pipeline parallel code in llama

* fix: convert image to RGB by default

* fix: allow getting the chat template from a url

* chore: avoid loading the unused layers and init the VLM up to the required feature space

* chore: enable bias w/ FP8 layers in CUTLASS kernels

* chore: upgrade flashinfer to 0.0.9

* feat: add custom triton cache manager

* chore: add CustomAP interface to UnquantizedFusedMoEMethod

* chore: handle aborted requests for jamba

* fix: minor fix for prompt adapter config

* feat: chat completions tokenization endpoint (#592)

* feat: optimize throughput to 1.4x by using numpy for token padding

* feat: MoE support with Pallas GMM kernel for TPUs

* chore: log spec decoding metrics

* chore: separate kv_scale into k_scale and v_scale

* feat: Asymmetric Tensor Parallel (#594)

* add utils for getting the partition offset and size for current tp rank

* disable asymmetric TP for quants and lora, handle GQA allocation

* the actual splitting work in the linear layers

* padding size for the vocab/lm_head should be optional

* cache engine and spec decode model runner (kwargs only)

* pass the tp_rank to model runners

* llama support

* update dockerfile

* let's not build these for now

* Revert "update dockerfile"

This reverts commit 6dd6408.

* fix: install wheel and packaging in docker

* fix: admin key arg

* let's try this again

* fix: mamba-ssm installation stuff

* chore: shutdown method for multiproc executor

* chore: log the message queue comms handle

* Port mamba kernels to Aphrodite (#595)

* kernels

* fix interface

* clean up dockerfile

* chore: set seed for dummy weights init

* fix: only create embeddings and lm_head when necessary for PP

* cleanup rocm dockerfile

* fix: some minor typing issues in spec decode

* fix: 4-node crash with PP

* chore: remove multimodal stuff from TPU

* fix: type annotation in worker

* refactor _prepare_model_input_tensor and attn metadata builder for most backends

* move prepare_inputs to the GPU (#596)

* add kernels

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

* sampler changes

* refactor the spec decode model runer and worker

---------

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

* update all benchmarks (#597)

* feat: add fp8 dynamic per-token quant kernel

* feat: pipeline parallel support for mixtral

* feat: add fp8 channel-wise weight quantization support

* feat: add asymmetric TP support for Qwen2

* feat: add CPU offloading support (#598)

* fix: avoid secondary error in ShmRingBuffer destructor

* feat: add SPMD worker execution using Ray  accelerated DAG

* `enable_gpu_advance_step` -> `allo_gpu_advance_step`

* chore: use the LoRA tokenizer in OpenAI API (#599)

* fix: use paged attention for bloc swapping/copying in flashinfer

* chore: refactor TPU model runner and worker

* chore: improve min_capability checking for `compressed-tensors`

* chore: implement fallback for fp8 channelwise using torch._scaled_mm

* fix: allow using mp executor for pipeline parallel

* fix: make speculative decoding work with per-request seed

* feat: non-uniform quantization via `compressed-tensors` for llama

* fix: the metrics endpoint was not mounted

* fix: raise an error for no draft token case when draft_tp>1

* chore: pass bias to quant_method.apply

* some small performance improvements

* update time since last collection for AsyncMetricsCollector

* fix shared memory bug w/ multi-node

* chore: enable dynamic per-token `fp8`

* docker: install libibverbs by default

* add scale_ub inputs to fp8 dynamic per-token quant

* chore: allow specifying custom Executor

* fix: request abort crashing pipeline parallel

* feat: fbgemm quantization support (#601)

* feat: fbgemm support

* missed this one

* register the quant

* chore: minor AMD fixes

* feat: support fbgemm_fp8 quant on ampere

* fix: input_scale for w8a8 is optional

* fix: channel-wise fp8 marlin

* move `aphrodite.endpoints.openai.chat_utils` -> `aphrodite.endpoints.chat_utils`

* feat: disable logprob serialization to CPU for spec decode

* chore: refactor and decouple phi3v image embedding

* fix: asymmetric TP changes breaking the gptq and awq quants (#602)

* feat: AWQ marlin kernels (#603)

* refactor gptq_marlin kernels to add awq support

* integrate

* fix: short commit hash import error

* feat: initial text-to-text support for Chameleon model

* clean up requirements

* chore: add a wrapper for torch.inference_mode decorator

* fix: `vocab_size` field access in llava

* fix: f-string fixes

* docs: add doc site with example content

* docs: add installation guides

* chore: add contribution guidelines + Code of Conduct (#507)

* add utils; proper project structuring

* add contributing guidelines and CoC

* remove docker

* chore: add contribution guidelines + Code of Conduct (#507)

* add utils; proper project structuring

* add contributing guidelines and CoC

* Refactor prompt processing (#605)

* wip

* finish up the refactor

* fix: use int64_t for indices in fp8 kernels

* feat: support loading lora adapters directly from HF

* chore: modularize prepare input and attn metadata builder

* fix: fbgemm_fp8 when modules_to_not_convert=None

* chore: automatically enable chunked prefill if model has large seqlen

* chore: move some verbose logs to debug

* chore: further improve logging

* feat: support FP8 KV Cache scales from compressed-tensors

* feat: script for multi-node cluster setup

* chore: manage all http connections in one place

* feat: allow image inputs with chameleon models

* fix: support ignore patterns in model loader

* fix: beta value in gelu_tanh kernel being divided by 0.5

* fix: some naming issues

* chore: add ignored layers for fp8 quant

* chore: add usage data in each chunk for serving_chat

* bump transformers

* fix: cache spec decode metrics when they get collected

* feat: support loading pre-quanted bnb checkpoints

* fix: flashinfer cuda graph capture with pipeline parallel

* fix: token padding for chameleon

* fix: `ignore_patterns` -> `ignore_file_pattern` for modelscope

* fix: miscalculated latency leading to ttft inaccuracy

* chore: split `run_server` into `build_server` and `run_server`

* chore: add fp8 support to `reshape_and_cache_flash`

* chore: tweaks to model/runner builder developer APIs

* chore: bump transformers

* fix: zmq hangs with large requests

* chore: represent tokens with identifiable strings

* feat: add support for MiniCPM-V

* fix: decode tokens w/ CUDA graphs and graps with flashinfer

* fix: allow passing -q {gptq,awq}_marlin as the arg

* fix: encoding format for embedding example

* fix: add image placeholder for openai server for minicpmv

* feat: `fp8-marlin` channel-wise quant via `compressed-tensors`

* fix: `kv_cache_dtype=fp8` without scales for fp8 checkpoints

* fix: prevent possible data race by adding sync

* fix: nulltr channelwise scales when loading wNa16 models

* fix: define self.forward_dag before init_workers_ray

* fix: replicatedlinear weight loading

* fix: pass signal from the main thread

* chore: use array to speedup padding

* feat: add nemotron HF support (#606)

* fix: promote another index in fp8 kernel to int64_t

* chore: minor simplifications for Dockerfile.rocm

* chore: allow initializing TPU in initialize_ray_cluster

* feat: tensor parallelism for CPU backend

* chore: simplify squared relu

* fix: disable enforce_eager for bnb

* feat: support collective comms in XLA devices, e.g. TPUs

* chore: factor out the code for running uvicorn

* fix: illegal mem access for fp8 l3.1 405b

* fix: torch nightly version for rocm dockerfile

* fix: do not enable chunked prefill and prefix caching for jamba

* feat: add support for head_size of 120

* feat: tensor parallelism for TPU with ray

* fix: torch.set_num_threads() in multiproc_gpu_executor

* chore: consolidate all vision examples into one file

* feat: add blip-2 support

* chore: reduce XLA compile times

* fix: better logging for memory profiling

* chore: perform allreduce in fp32 for marlin, better logging

* fix: add nemotron to PP_SUPPORTED_MODELS

* fix: pass cutlass_fp8_supported correctly for fbgemm_fp8

* feat: add internvl support

* chore: tune fp8 kernels for ada lovelace cards

* fix: reduce unnecessary compute when logprobs=None

* fix: deprecation warnings in squeezellm quant_cuda_kernel

* chore: enable tpu tensor parallel in async engine

* fix: remove timm as a hardcoded requirement

* chore: make triton fully optional

* feat: add allowed_token_ids

* fix: unused variables in awq gemm kernel

* fix: divide-by-zero warnings in marlin kernels

* chore: tune int8 kernels for ada lovelace

* fix: greedy decoding in TPU

* fix: paligemma mmp

* fix: seeded gens with pipeline parallel

* fix: compiler warnings for _C and _moe

* chore: bump openvino toolkit to pre-release

* fix: remove scaled_fp8_quant_kernel padding footgun

* fix: massively improve throughput with high number of prompts

* fix: remove artifact

* chore: add punica sizes for mistral nemo

* fix: conditionally import outlines.caching

* fix: wrap all outlines imports

* chore: sort args (#608)

* chore: sort args

* Update aphrodite/modeling/guided_decoding/outlines_logits_processors.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* Update aphrodite/engine/args_tools.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* Update aphrodite/engine/args_tools.py

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* fix: Device Options category

* fix: Load Options category with load_format, dtype and ignore_patterns

* fix: API Options -> Inference Options with seed, served_model_name moved to model options

* fix: categorize based off of config.py and wrong category names

* fix: move `model` arg to the top

* fix: add missing `max_seq_lens_to_capture` arg

* chore: sort the device arg

* chore: move `model` arg to the top of the argparser

* chore: remove old comment

---------

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>
Co-authored-by: AlpinDale <alpindale@gmail.com>

* fix: formatting

* feat: add yaml config parsing (#610)

* feat: add yaml config parsing

* fix: prompt adapters

* chore: add isort and refactor formatting script and utils

* fix: broadcasting logic for multi_modal_kwargs

* fix: set readonly=True for non-root TPU devices

* fix: logit processor exceeding vocab size

* feat: support for QQQ W4A8 quantization (#612)

* feat: add qqq marlin kernels

* integrate qqq quant

* fix: cleanup minicpm-v and port na_vit model

* fix: feature size calculation for Llava-next

* feat: use FusedMoE for jamba

* feat: allow loading specific layer numbers per device

* fix: fp8 marlin and cpu offloading with fp8 marlin

* chore: enable fp8 cutlass for ada lovelace

* chore; tune cutlass int8 kernels for sm_75

* feat: Triton Kernels for Punica (#613)

* feat: replace CUDA kernels w/ triton for lora

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* fix: __init__ in ops

* fix: replicated linear layer support for LoRA

* cleanup and relax the conditions for vocab_size and rank

* fix: add consolidated* to ignore_patterns

---------

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* chore: add pipeline parallel support for Qwen

* fix: don't use torch.generator() for TPU

* fix: skip loading lm_head for tie_word_embeddings models

* chore: optimize PP comm by replacing send with partial send + allgather

* fix: set a default max_tokens for OAI requests

* chore: optimize scheduler and remove policy

* chore: bump torch to 2.4.0

* bump to torch 2.4.0, add aphrodite_flash_attn (#614)

* fix: RMSNorm forward in InternViT attention qk_layernorm

* 'fix: lower gemma's unloaded_params exception to warning

* feat: support logits soft capping with flash attention backend

* chore: optimize get_seqs

* fix: input shape for flashinfer prefill wrapper

* fix: remove error_on_invalid_device_count_status

* fix: remove unused code in sampler

* build: update torch to 2.4.0 for cpu

* kernels: disambiguate quantized types via a new ScalarType

Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

* chore: pipeline parallel with Ray accelerated dag

* fix: use loopback address for single node again

* chore: add env var to enable torch.compile

* revert: incorrect nightly build

* feat: add RPC server and client via ZMQ (#615)

* feat: add RPC server and client

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* add async engine protocols

* add new methods to sync and async engine

* producer utils

* migrate serving engine, embedding and tokenization to rpc

* migrate text completions

* migrate chat completions

* migrate logits processors api

* migrate api server

* forgot the arg

* minor naming issues

---------

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* refactor: factor out chat message parsing

* refactor: add has_prefix_cache_hit flag to FlashAttentionMetadataBuilder

* feat: add guided decoding to LLM

* chore: simplify output processing with shortcut for non-parallel sampling and non-beam search usecase (#616)

* refactor: minicpmv and port Idefix2VisionTransformer

* refactor: factor out code for running uvicorn again

* feat: port SiglipVisionModel from transformers

* chore: add proper logging for spec decoding verification

* fix: support flashinfer for draft model runner

* fix: use ipv4 localhost form for zmq bind

* fix: use args.trust_remote_code

* chore: update cutlass to 3.5.1

* fix: specify device when loading lora and embedding tensors

* feat: non-blocking transfer in prepare_input

* feat: re-add GGUF (#600)

* refactor gguf kernels

* fix: incorrect filename for vecdotq header

* finish up the re-impl

* add requirements

* minor CI fixes

* ci: a few more ignores

* ci: remove clang-format

* ci: take one of fixing lint issues

* ci: codespell fixes

* ci: remove yapf

* ci: remove yapf from the formatting script

* ci: remove isort

* chore: minor cleanups

* fix: allow loading GGUF model without .gguf extension

* fix: cpu offloading with gptq

* docs: finalize User & Developer Documentation for Release Candidate (#618)

* add getting started page

* add debugging tips

* add openai docs

* add distributed guide

* add production metrics and model support matrix

* add guide for adding new models

* huge update

* add vlm usage docs

* ci: add action for deploying docs

* docs: fix typos

* chore: refactor wheel build script

* bump version to 0.6.0

---------

Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Ahmed <mail@ahme.dev>
Co-authored-by: ewof <elwolf6@protonmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant