Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* chore: skip the driver worker * chore: bump lmfe version to 0.10.3 * chore: some more marlin cleanups * chore: deprecation warning for beam search * feat: support FP8 for DeepSeekV2 MoE * feat: add fuyu vision model and persimmon language model support * fix: turn off cutlass scaled_mm for ada lovelace cards * chore: allow quantizing all layers of deepseek-v2 * fix: build with pylimited api in the docker file * OpenAI API Refactor (#591) * feat: massive api server refactoring * fix: tokenizer endpoint issues * fix: BatchResponseData body should be optional * chore: simplify pipeline parallel code in llama * fix: convert image to RGB by default * fix: allow getting the chat template from a url * chore: avoid loading the unused layers and init the VLM up to the required feature space * chore: enable bias w/ FP8 layers in CUTLASS kernels * chore: upgrade flashinfer to 0.0.9 * feat: add custom triton cache manager * chore: add CustomAP interface to UnquantizedFusedMoEMethod * chore: handle aborted requests for jamba * fix: minor fix for prompt adapter config * feat: chat completions tokenization endpoint (#592) * feat: optimize throughput to 1.4x by using numpy for token padding * feat: MoE support with Pallas GMM kernel for TPUs * chore: log spec decoding metrics * chore: separate kv_scale into k_scale and v_scale * feat: Asymmetric Tensor Parallel (#594) * add utils for getting the partition offset and size for current tp rank * disable asymmetric TP for quants and lora, handle GQA allocation * the actual splitting work in the linear layers * padding size for the vocab/lm_head should be optional * cache engine and spec decode model runner (kwargs only) * pass the tp_rank to model runners * llama support * update dockerfile * let's not build these for now * Revert "update dockerfile" This reverts commit 6dd6408. * fix: install wheel and packaging in docker * fix: admin key arg * let's try this again * fix: mamba-ssm installation stuff * chore: shutdown method for multiproc executor * chore: log the message queue comms handle * Port mamba kernels to Aphrodite (#595) * kernels * fix interface * clean up dockerfile * chore: set seed for dummy weights init * fix: only create embeddings and lm_head when necessary for PP * cleanup rocm dockerfile * fix: some minor typing issues in spec decode * fix: 4-node crash with PP * chore: remove multimodal stuff from TPU * fix: type annotation in worker * refactor _prepare_model_input_tensor and attn metadata builder for most backends * move prepare_inputs to the GPU (#596) * add kernels Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> * sampler changes * refactor the spec decode model runer and worker --------- Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> * update all benchmarks (#597) * feat: add fp8 dynamic per-token quant kernel * feat: pipeline parallel support for mixtral * feat: add fp8 channel-wise weight quantization support * feat: add asymmetric TP support for Qwen2 * feat: add CPU offloading support (#598) * fix: avoid secondary error in ShmRingBuffer destructor * feat: add SPMD worker execution using Ray accelerated DAG * `enable_gpu_advance_step` -> `allo_gpu_advance_step` * chore: use the LoRA tokenizer in OpenAI API (#599) * fix: use paged attention for bloc swapping/copying in flashinfer * chore: refactor TPU model runner and worker * chore: improve min_capability checking for `compressed-tensors` * chore: implement fallback for fp8 channelwise using torch._scaled_mm * fix: allow using mp executor for pipeline parallel * fix: make speculative decoding work with per-request seed * feat: non-uniform quantization via `compressed-tensors` for llama * fix: the metrics endpoint was not mounted * fix: raise an error for no draft token case when draft_tp>1 * chore: pass bias to quant_method.apply * some small performance improvements * update time since last collection for AsyncMetricsCollector * fix shared memory bug w/ multi-node * chore: enable dynamic per-token `fp8` * docker: install libibverbs by default * add scale_ub inputs to fp8 dynamic per-token quant * chore: allow specifying custom Executor * fix: request abort crashing pipeline parallel * feat: fbgemm quantization support (#601) * feat: fbgemm support * missed this one * register the quant * chore: minor AMD fixes * feat: support fbgemm_fp8 quant on ampere * fix: input_scale for w8a8 is optional * fix: channel-wise fp8 marlin * move `aphrodite.endpoints.openai.chat_utils` -> `aphrodite.endpoints.chat_utils` * feat: disable logprob serialization to CPU for spec decode * chore: refactor and decouple phi3v image embedding * fix: asymmetric TP changes breaking the gptq and awq quants (#602) * feat: AWQ marlin kernels (#603) * refactor gptq_marlin kernels to add awq support * integrate * fix: short commit hash import error * feat: initial text-to-text support for Chameleon model * clean up requirements * chore: add a wrapper for torch.inference_mode decorator * fix: `vocab_size` field access in llava * fix: f-string fixes * docs: add doc site with example content * docs: add installation guides * chore: add contribution guidelines + Code of Conduct (#507) * add utils; proper project structuring * add contributing guidelines and CoC * remove docker * chore: add contribution guidelines + Code of Conduct (#507) * add utils; proper project structuring * add contributing guidelines and CoC * Refactor prompt processing (#605) * wip * finish up the refactor * fix: use int64_t for indices in fp8 kernels * feat: support loading lora adapters directly from HF * chore: modularize prepare input and attn metadata builder * fix: fbgemm_fp8 when modules_to_not_convert=None * chore: automatically enable chunked prefill if model has large seqlen * chore: move some verbose logs to debug * chore: further improve logging * feat: support FP8 KV Cache scales from compressed-tensors * feat: script for multi-node cluster setup * chore: manage all http connections in one place * feat: allow image inputs with chameleon models * fix: support ignore patterns in model loader * fix: beta value in gelu_tanh kernel being divided by 0.5 * fix: some naming issues * chore: add ignored layers for fp8 quant * chore: add usage data in each chunk for serving_chat * bump transformers * fix: cache spec decode metrics when they get collected * feat: support loading pre-quanted bnb checkpoints * fix: flashinfer cuda graph capture with pipeline parallel * fix: token padding for chameleon * fix: `ignore_patterns` -> `ignore_file_pattern` for modelscope * fix: miscalculated latency leading to ttft inaccuracy * chore: split `run_server` into `build_server` and `run_server` * chore: add fp8 support to `reshape_and_cache_flash` * chore: tweaks to model/runner builder developer APIs * chore: bump transformers * fix: zmq hangs with large requests * chore: represent tokens with identifiable strings * feat: add support for MiniCPM-V * fix: decode tokens w/ CUDA graphs and graps with flashinfer * fix: allow passing -q {gptq,awq}_marlin as the arg * fix: encoding format for embedding example * fix: add image placeholder for openai server for minicpmv * feat: `fp8-marlin` channel-wise quant via `compressed-tensors` * fix: `kv_cache_dtype=fp8` without scales for fp8 checkpoints * fix: prevent possible data race by adding sync * fix: nulltr channelwise scales when loading wNa16 models * fix: define self.forward_dag before init_workers_ray * fix: replicatedlinear weight loading * fix: pass signal from the main thread * chore: use array to speedup padding * feat: add nemotron HF support (#606) * fix: promote another index in fp8 kernel to int64_t * chore: minor simplifications for Dockerfile.rocm * chore: allow initializing TPU in initialize_ray_cluster * feat: tensor parallelism for CPU backend * chore: simplify squared relu * fix: disable enforce_eager for bnb * feat: support collective comms in XLA devices, e.g. TPUs * chore: factor out the code for running uvicorn * fix: illegal mem access for fp8 l3.1 405b * fix: torch nightly version for rocm dockerfile * fix: do not enable chunked prefill and prefix caching for jamba * feat: add support for head_size of 120 * feat: tensor parallelism for TPU with ray * fix: torch.set_num_threads() in multiproc_gpu_executor * chore: consolidate all vision examples into one file * feat: add blip-2 support * chore: reduce XLA compile times * fix: better logging for memory profiling * chore: perform allreduce in fp32 for marlin, better logging * fix: add nemotron to PP_SUPPORTED_MODELS * fix: pass cutlass_fp8_supported correctly for fbgemm_fp8 * feat: add internvl support * chore: tune fp8 kernels for ada lovelace cards * fix: reduce unnecessary compute when logprobs=None * fix: deprecation warnings in squeezellm quant_cuda_kernel * chore: enable tpu tensor parallel in async engine * fix: remove timm as a hardcoded requirement * chore: make triton fully optional * feat: add allowed_token_ids * fix: unused variables in awq gemm kernel * fix: divide-by-zero warnings in marlin kernels * chore: tune int8 kernels for ada lovelace * fix: greedy decoding in TPU * fix: paligemma mmp * fix: seeded gens with pipeline parallel * fix: compiler warnings for _C and _moe * chore: bump openvino toolkit to pre-release * fix: remove scaled_fp8_quant_kernel padding footgun * fix: massively improve throughput with high number of prompts * fix: remove artifact * chore: add punica sizes for mistral nemo * fix: conditionally import outlines.caching * fix: wrap all outlines imports * chore: sort args (#608) * chore: sort args * Update aphrodite/modeling/guided_decoding/outlines_logits_processors.py Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com> * Update aphrodite/engine/args_tools.py Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com> * Update aphrodite/engine/args_tools.py Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com> * fix: Device Options category * fix: Load Options category with load_format, dtype and ignore_patterns * fix: API Options -> Inference Options with seed, served_model_name moved to model options * fix: categorize based off of config.py and wrong category names * fix: move `model` arg to the top * fix: add missing `max_seq_lens_to_capture` arg * chore: sort the device arg * chore: move `model` arg to the top of the argparser * chore: remove old comment --------- Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com> Co-authored-by: AlpinDale <alpindale@gmail.com> * fix: formatting * feat: add yaml config parsing (#610) * feat: add yaml config parsing * fix: prompt adapters * chore: add isort and refactor formatting script and utils * fix: broadcasting logic for multi_modal_kwargs * fix: set readonly=True for non-root TPU devices * fix: logit processor exceeding vocab size * feat: support for QQQ W4A8 quantization (#612) * feat: add qqq marlin kernels * integrate qqq quant * fix: cleanup minicpm-v and port na_vit model * fix: feature size calculation for Llava-next * feat: use FusedMoE for jamba * feat: allow loading specific layer numbers per device * fix: fp8 marlin and cpu offloading with fp8 marlin * chore: enable fp8 cutlass for ada lovelace * chore; tune cutlass int8 kernels for sm_75 * feat: Triton Kernels for Punica (#613) * feat: replace CUDA kernels w/ triton for lora Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * fix: __init__ in ops * fix: replicated linear layer support for LoRA * cleanup and relax the conditions for vocab_size and rank * fix: add consolidated* to ignore_patterns --------- Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * chore: add pipeline parallel support for Qwen * fix: don't use torch.generator() for TPU * fix: skip loading lm_head for tie_word_embeddings models * chore: optimize PP comm by replacing send with partial send + allgather * fix: set a default max_tokens for OAI requests * chore: optimize scheduler and remove policy * chore: bump torch to 2.4.0 * bump to torch 2.4.0, add aphrodite_flash_attn (#614) * fix: RMSNorm forward in InternViT attention qk_layernorm * 'fix: lower gemma's unloaded_params exception to warning * feat: support logits soft capping with flash attention backend * chore: optimize get_seqs * fix: input shape for flashinfer prefill wrapper * fix: remove error_on_invalid_device_count_status * fix: remove unused code in sampler * build: update torch to 2.4.0 for cpu * kernels: disambiguate quantized types via a new ScalarType Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> * chore: pipeline parallel with Ray accelerated dag * fix: use loopback address for single node again * chore: add env var to enable torch.compile * revert: incorrect nightly build * feat: add RPC server and client via ZMQ (#615) * feat: add RPC server and client Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <joe@joerun.de> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Simon Mo <simon.mo@hey.com> * add async engine protocols * add new methods to sync and async engine * producer utils * migrate serving engine, embedding and tokenization to rpc * migrate text completions * migrate chat completions * migrate logits processors api * migrate api server * forgot the arg * minor naming issues --------- Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <joe@joerun.de> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Simon Mo <simon.mo@hey.com> * refactor: factor out chat message parsing * refactor: add has_prefix_cache_hit flag to FlashAttentionMetadataBuilder * feat: add guided decoding to LLM * chore: simplify output processing with shortcut for non-parallel sampling and non-beam search usecase (#616) * refactor: minicpmv and port Idefix2VisionTransformer * refactor: factor out code for running uvicorn again * feat: port SiglipVisionModel from transformers * chore: add proper logging for spec decoding verification * fix: support flashinfer for draft model runner * fix: use ipv4 localhost form for zmq bind * fix: use args.trust_remote_code * chore: update cutlass to 3.5.1 * fix: specify device when loading lora and embedding tensors * feat: non-blocking transfer in prepare_input * feat: re-add GGUF (#600) * refactor gguf kernels * fix: incorrect filename for vecdotq header * finish up the re-impl * add requirements * minor CI fixes * ci: a few more ignores * ci: remove clang-format * ci: take one of fixing lint issues * ci: codespell fixes * ci: remove yapf * ci: remove yapf from the formatting script * ci: remove isort * chore: minor cleanups * fix: allow loading GGUF model without .gguf extension * fix: cpu offloading with gptq * docs: finalize User & Developer Documentation for Release Candidate (#618) * add getting started page * add debugging tips * add openai docs * add distributed guide * add production metrics and model support matrix * add guide for adding new models * huge update * add vlm usage docs * ci: add action for deploying docs * docs: fix typos * chore: refactor wheel build script * bump version to 0.6.0 --------- Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: Ahmed <mail@ahme.dev> Co-authored-by: ewof <elwolf6@protonmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <joe@joerun.de> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
- Loading branch information