Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] GPTQ Activation Ordering #8135

Merged
merged 11 commits into from
Sep 9, 2024

Conversation

kylesayrs
Copy link
Contributor

@kylesayrs kylesayrs commented Sep 3, 2024

Activation Ordering

activation-ordering-diagram

Changes

  • Update QuantizationArgs to support actorder argument
  • Add weight_g_idx parameter loader which defaults to all -1s
    • If weight_g_idx is loaded with valid values, then the parameter is passed to the kernel
    • If weight_g_idx is not loaded, then no column reordering is performed

Testing

Inference Script

infer_actorder.py
from vllm import LLM
llm = LLM("nm-testing/TinyLlama-1.1B-Chat-v1.0-actorder-group")
llm.generate("The future of AI is")

Actorder=Group Evaluation

Meta-Llama-3.1-8B-Instruct-quantized.w4a16 Ours Stderr
78.57 78.87 ± 00.41
50.46 49.24 ± 01.52
76.48 76.00 ± 01.20

These results have discrepancies not related to activation ordering, more precise results will be posted at a later date.

Actorder=Weight Evaluation

Accuracy

Accuracy evaluations were performed using compressed Meta-Llama-3-8B-Instruct as a base model. For reference, Meta-Llama-3-8B-Instruct-quantized.w4a16 was compressed using AutoGPTQ desc_act=True and achieved 72.25% on GSM-8K (5-shot, strict-match).

The following models were quantized using llm-compressor with the same quantization configuration but 156 calibration samples as opposed to 256.

Group Activation Ordering

vllm (pretrained=/home/ksayers/llm-compressor/llama3_actorder_group,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto                           
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|                                                                                                            
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|                                                                                                            
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.7240|?  |0.0123|                                                                                                            
|     |       |strict-match    |     5|exact_match|?  |0.7263|?  |0.0123|

Weight Activation Ordering

vllm (pretrained=/home/ksayers/llm-compressor/llama3_actorder_weight,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto                          
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|                                                                                                            
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|                                                                                                            
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.7331|?  |0.0122|                                                                                                            
|     |       |strict-match    |     5|exact_match|?  |0.7354|?  |0.0122|

Latency

Group Activation Ordering

Avg latency: 2.0740114107728003 seconds
10% percentile latency: 2.0585451871156693 seconds
25% percentile latency: 2.0614405693486333 seconds
50% percentile latency: 2.0715408828109503 seconds
75% percentile latency: 2.0796915404498577 seconds
90% percentile latency: 2.0825428303331135 seconds
99% percentile latency: 2.153371973782778 seconds

Weight Activation Ordering

Avg latency: 2.0243701239426932 seconds
10% percentile latency: 2.0120500519871714 seconds
25% percentile latency: 2.01621649088338 seconds
50% percentile latency: 2.020678504370153 seconds
75% percentile latency: 2.026012707967311 seconds
90% percentile latency: 2.0294162426143885 seconds
99% percentile latency: 2.1007918303459885 seconds

Copy link

github-actions bot commented Sep 3, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@kylesayrs kylesayrs changed the title GPTQ Activation Ordering [Misc] GPTQ Activation Ordering Sep 3, 2024
@kylesayrs kylesayrs marked this pull request as draft September 5, 2024 15:47
Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Two open questions + can you confirm the act order models worked fine for 2 and 4 gpus?

@kylesayrs kylesayrs marked this pull request as ready for review September 6, 2024 13:56
@robertgshaw2-neuralmagic
Copy link
Collaborator

/ready

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 6, 2024
@kylesayrs
Copy link
Contributor Author

Confirmed that group activation ordering works with tp=2,4

@LucasWilkinson
Copy link
Contributor

I might be mistaken, but actorder="weight" feels like purely a change in how (the order) llm-compressor quantizes weights, in which case why can't vLLM be oblivious to this fact? It appears from vLLM's perspective actorder="weight" and actorder=None/False is identical, in the sense that it's not reordering activations in either case.

So naming feels a bit confusing sinceactorder="weight" doesn't sound like "activation rendering" at all, maybe we can simplify this by adding a quantorder key in the checkpoint (for record keeping) and just set actorder=None/False that way vLLM can just ignore quantorder completely and actorder is just a boolean indicating if activations should be reordered or not (we could leave it as a str/enum for future proofing if we want but would be a boolean at this point in time)?

I think we should try to create better separation concerns, i.e. better separation on how the model was quantized and how the model needs to be run for inference otherwise its just more things/complexity kernel authors need familiarize themselves with only to realize (in this case) it has no impact.

@kylesayrs
Copy link
Contributor Author

I agree that there are benefits to separating "actorder" and "quantorder" as two orthogonal arguments, that being that vllm would only have to check for the "actorder=True" case.

I think one large downside to separating "actorder" from "quant order" is that we're essentially redefining "actorder" as "activation ordering groups". This means that an llm-compressor user might turn on "actorder", expecting it to also do quantization ordering like it does in GPTQ, then get a model that has additional latency but no performance gain. This is confusing for users, not only because they have to redefine the concept of actorder, but because there's a potential pitfall case now added.

There probably exist better names to better explain the ideas of "non-sequential grouping" and "quantization ordering", but folding these cases into the "actorder" argument allows us to leverage users' existing understanding of activation ordering.

@kylesayrs
Copy link
Contributor Author

In a followup PR, we could keep actorder, but add an additional argument continguous-groups. In the pydantic model we can validate that the continugous-groups argument is True iff actorder="group". This way a compression user can use the nice actorder arugment, but vllm only needs to pay attention to contiguous-groups

@LucasWilkinson
Copy link
Contributor

I think the goal here would be make it so vLLM doesnt have to maintain an enum of actorders so that users can experiment with orderings in llm-compressor without having to submit PRs to vLLM

In offline conversations with @kylesayrs , its seems like a good alternative solution would be to have a non-contiguous-groups flag which when present and true indicates if a g_idx is present (since g_idx doesnt actually contain the activation order instead but instead just a mapping of in-channels to groups)

This way vLLM can completely ignore actorder (opening room for experimentation with things like the weight order (and potential variations of that), or folding activation permutations into the MLP layers [1, 2] without having to update vLLM. In this case actorder would only exist in the checkpoint as "record keeping" so people can remember what was used to create the checkpoint.

ideally though we wouldn't have compressed-tensor checkpoints in the wild with actorder="group" but no non-contiguous-groups flag, since then we'd have to have extra code/logic in vLLM for backwards compatibility (which would be a shame for just a couple models) @mgoin @kylesayrs do we know if there's any models like this in the wild?

@kylesayrs
Copy link
Contributor Author

@LucasWilkinson A decision has been made that we're not going to support the more generalized checkpoint config in this PR, and will instead continue with using list which specifies which activations orderings come with non-contiguous groupings.

I would characterizing the problem as being a difference in requirements between recipe configs and checkpoint configs, as well as a problem of coupling between llm-compressor and vllm-CT. While it is a problem that will likely come up again, it's considered out of scope for this PR.

There are some options to make the checkpoint config more extensible/ decoupled from llm-compressor in the future. The two that I have proposed are (1) separating the recipe/compression config from the checkpoint config or (2) adding an ActivationOrdering.NON_CONTIGUOUS option to the enum, which would help to generalize the config and allow compressors to be more vague while still giving enough information for correct inference.

Option (1) would likely involve some legacy support for backwards compatibility. Option (2) would be backwards compatible. Option (3) is to maintain a coupling between llm-compressor and vllm-CT. Hopefully these options are palatable enough to defer for a separate PR.

@LucasWilkinson
Copy link
Contributor

LucasWilkinson commented Sep 9, 2024

Cool, ya was just wanting making my voice heard that I think we should be striving separation of concerns in our software design, especially across repos to avoid versioning headaches in the future (and having to coordinate PRs across many repos).

Im not a huge fan option 2 as it is just a continuation of convoluting the difference between activation ordering and group assignment (i.e. mixing something that is fairly GPTQ specific with something that could be viewed as a more general group quantization thing).

Option 1 seems like what we should be striving for as it will give us more flexibly going forward even outside of this specific actorder case. Im not sure what you mean by "checkpoint config" but I assume you mean the fields validated by compressed tensors. I would imagine for option 1 you would still want to save recipe information for accuracy debugging and general recorded keeping, but stored in an separate structure or using keys that vLLM and compressed tensors can just ignore / pass-through.

@@ -232,7 +232,8 @@ def _get_scheme_from_parts(
return CompressedTensorsWNA16(
num_bits=weight_quant.num_bits,
strategy=weight_quant.strategy,
group_size=weight_quant.group_size)
group_size=weight_quant.group_size,
actorder=weight_quant.actorder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just add the condition here?
actorder=weight_quant.actorder == ActivationOrdering.GROUP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but it feels more logical to me to keep argument processing within the CompressedTensorsWNA16.__init__ function. This separates responsibilities and makes clear that the job of _get_scheme_from_parts is to decide which compression scheme applies, not to process the arguments once the scheme is decided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd also have to rename the actorder argument of CompressedTensorsWNA16.__init__, otherwise it would be a misnomer

@kylesayrs
Copy link
Contributor Author

@LucasWilkinson A recipe config means fields that are important to compression algorithms, a checkpoint config means fields relevant to vllm and serving. I agree that checkpoint configs should have some info about how they were compressed, but those fields wouldn't be a requirement.

Copy link
Collaborator

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the careful work and discussion! This is good to land for now

@mgoin mgoin merged commit c7cb5c3 into vllm-project:main Sep 9, 2024
50 checks passed
opus24 added a commit to Hyper-Accel/vllm that referenced this pull request Sep 10, 2024
commit a1d8742
Author: Simon Mo <simon.mo@hey.com>
Date:   Mon Sep 9 23:21:00 2024 -0700

    Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (vllm-project#8319)

commit 6cd5e5b
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Mon Sep 9 23:02:52 2024 -0400

    [Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

commit c7cb5c3
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Mon Sep 9 16:27:26 2024 -0400

    [Misc] GPTQ Activation Ordering (vllm-project#8135)

commit f9b4a2d
Author: Vladislav Kruglikov <vladislavkruglikov@outlook.com>
Date:   Mon Sep 9 21:20:46 2024 +0300

    [Bugfix] Correct adapter usage for cohere and jamba (vllm-project#8292)

commit 58fcc85
Author: Adam Lugowski <alugowski@gmail.com>
Date:   Mon Sep 9 11:16:37 2024 -0700

    [Frontend] Add progress reporting to run_batch.py (vllm-project#8060)

    Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io>

commit 08287ef
Author: Kyle Mistele <kyle@mistele.com>
Date:   Mon Sep 9 09:45:11 2024 -0500

    [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (vllm-project#8272)

commit 4ef41b8
Author: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Date:   Sun Sep 8 00:01:51 2024 -0400

    [Bugfix] Fix async postprocessor in case of preemption (vllm-project#8267)

commit cfe712b
Author: Joe Runde <Joseph.Runde@ibm.com>
Date:   Sat Sep 7 14:03:16 2024 -0600

    [CI/Build] Use python 3.12 in cuda image (vllm-project#8133)

    Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

commit b962ee1
Author: sumitd2 <91451282+sumitd2@users.noreply.github.com>
Date:   Sat Sep 7 23:48:40 2024 +0530

    ppc64le: Dockerfile fixed, and a script for buildkite (vllm-project#8026)

commit 36bf815
Author: Isotr0py <2037008807@qq.com>
Date:   Sun Sep 8 01:45:44 2024 +0800

    [Model][VLM] Decouple weight loading logic for `Paligemma` (vllm-project#8269)

commit e807125
Author: Isotr0py <2037008807@qq.com>
Date:   Sat Sep 7 16:38:23 2024 +0800

    [Model][VLM] Support multi-images inputs for InternVL2 models (vllm-project#8201)

commit 9f68e00
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sat Sep 7 16:02:39 2024 +0800

    [Bugfix] Fix broken OpenAI tensorizer test (vllm-project#8258)

commit ce2702a
Author: youkaichao <youkaichao@gmail.com>
Date:   Fri Sep 6 22:40:46 2024 -0700

    [tpu][misc] fix typo (vllm-project#8260)

commit 795b662
Author: Wei-Sheng Chin <wschin@outlook.com>
Date:   Fri Sep 6 20:18:16 2024 -0700

    Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (vllm-project#8241)

commit 2f707fc
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sat Sep 7 10:57:24 2024 +0800

    [Model] Multi-input support for LLaVA (vllm-project#8238)

commit 41e95c5
Author: Kyle Mistele <kyle@mistele.com>
Date:   Fri Sep 6 21:49:01 2024 -0500

    [Bugfix] Fix Hermes tool call chat template bug (vllm-project#8256)

    Co-authored-by: Kyle Mistele <kyle@constellate.ai>

commit 12dd715
Author: William Lin <SolitaryThinker@users.noreply.github.com>
Date:   Fri Sep 6 17:48:48 2024 -0700

    [misc] [doc] [frontend] LLM torch profiler support (vllm-project#7943)

commit 29f49cd
Author: Patrick von Platen <patrick.v.platen@gmail.com>
Date:   Sat Sep 7 01:02:05 2024 +0200

    [Model] Allow loading from original Mistral format (vllm-project#8168)

    Co-authored-by: Michael Goin <michael@neuralmagic.com>

commit 23f3222
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Fri Sep 6 18:29:03 2024 -0400

    [Misc] Remove `SqueezeLLM` (vllm-project#8220)

commit 9db52ea
Author: rasmith <Randall.Smith@amd.com>
Date:   Fri Sep 6 17:26:09 2024 -0500

    [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput (vllm-project#8248)

commit 1447c97
Author: Alexey Kondratiev(AMD) <143633163+alexeykondrat@users.noreply.github.com>
Date:   Fri Sep 6 14:51:03 2024 -0400

    [CI/Build] Increasing timeout for multiproc worker tests (vllm-project#8203)

commit de80783
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date:   Fri Sep 6 09:18:35 2024 -0700

    [Misc] Use ray[adag] dependency instead of cuda (vllm-project#7938)

commit e5cab71
Author: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Date:   Fri Sep 6 12:01:14 2024 -0400

    [Frontend] Add --logprobs argument to `benchmark_serving.py` (vllm-project#8191)

commit baa5467
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Thu Sep 5 20:39:29 2024 -0700

    [BugFix] Fix Granite model configuration (vllm-project#8216)

commit db3bf7c
Author: Jiaxin Shan <seedjeffwan@gmail.com>
Date:   Thu Sep 5 18:10:33 2024 -0700

    [Core] Support load and unload LoRA in api server (vllm-project#6566)

    Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

commit 2febcf2
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Thu Sep 5 13:25:29 2024 -0700

    [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM (vllm-project#7962)

commit 2ee4528
Author: Michael Goin <michael@neuralmagic.com>
Date:   Thu Sep 5 11:09:46 2024 -0400

    Move verify_marlin_supported to GPTQMarlinLinearMethod (vllm-project#8165)

commit 9da25a8
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Thu Sep 5 06:48:10 2024 -0600

    [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (vllm-project#8029)

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit 8685ba1
Author: manikandan.tm@zucisystems.com <94887255+Manikandan-Thangaraj-ZS0321@users.noreply.github.com>
Date:   Thu Sep 5 17:03:37 2024 +0530

    Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (vllm-project#7860)

commit 288a938
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Thu Sep 5 18:51:53 2024 +0800

    [Doc] Indicate more information about supported modalities (vllm-project#8181)

commit e39ebf5
Author: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Date:   Wed Sep 4 22:12:26 2024 -0700

    [Core/Bugfix] Add query dtype as per FlashInfer API requirements. (vllm-project#8173)

commit ba262c4
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Wed Sep 4 20:33:12 2024 -0700

    [ci] Mark LoRA test as soft-fail (vllm-project#8160)

    Signed-off-by: kevin <kevin@anyscale.com>

commit 4624d98
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Wed Sep 4 20:31:48 2024 -0700

    [Misc] Clean up RoPE forward_native (vllm-project#8076)

commit 1afc931
Author: William Lin <SolitaryThinker@users.noreply.github.com>
Date:   Wed Sep 4 17:35:36 2024 -0700

    [bugfix] >1.43 constraint for openai (vllm-project#8169)

    Co-authored-by: Michael Goin <michael@neuralmagic.com>

commit e01c2be
Author: Maureen McElaney <mmcelaney@users.noreply.github.com>
Date:   Wed Sep 4 19:50:13 2024 -0400

    [Doc] [Misc] Create CODE_OF_CONDUCT.md (vllm-project#8161)

commit 32e7db2
Author: Simon Mo <simon.mo@hey.com>
Date:   Wed Sep 4 16:34:27 2024 -0700

    Bump version to v0.6.0 (vllm-project#8166)

commit 008cf88
Author: Harsha vardhan manoj Bikki <39381063+hbikki@users.noreply.github.com>
Date:   Wed Sep 4 16:33:43 2024 -0700

    [Neuron] Adding support for adding/ overriding neuron configuration a… (vllm-project#8062)

    Co-authored-by: Harsha Bikki <harbikh@amazon.com>

commit 77d9e51
Author: Cody Yu <hao.yu.cody@gmail.com>
Date:   Wed Sep 4 13:23:22 2024 -0700

    [MISC] Replace input token throughput with total token throughput (vllm-project#8164)

    Co-authored-by: Michael Goin <michael@neuralmagic.com>

commit e02ce49
Author: Kyle Mistele <kyle@mistele.com>
Date:   Wed Sep 4 15:18:13 2024 -0500

    [Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (vllm-project#5649)

    Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
    Co-authored-by: Kyle Mistele <kyle@constellate.ai>

commit 561d6f8
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Wed Sep 4 13:05:50 2024 -0700

    [CI] Change test input in Gemma LoRA test (vllm-project#8163)

commit d1dec64
Author: alexeykondrat <143633163+alexeykondrat@users.noreply.github.com>
Date:   Wed Sep 4 14:57:54 2024 -0400

    [CI/Build][ROCm] Enabling LoRA tests on ROCm (vllm-project#7369)

    Co-authored-by: Simon Mo <simon.mo@hey.com>

commit 2ad2e56
Author: Cody Yu <hao.yu.cody@gmail.com>
Date:   Wed Sep 4 11:53:25 2024 -0700

    [MISC] Consolidate FP8 kv-cache tests (vllm-project#8131)

commit d331156
Author: wnma <wnma3mz@gmail.com>
Date:   Wed Sep 4 18:55:37 2024 +0800

    [Bugfix] remove post_layernorm in siglip (vllm-project#8106)

commit ccd7207
Author: TimWang <7367474+haitwang-cloud@users.noreply.github.com>
Date:   Wed Sep 4 14:17:05 2024 +0800

    chore: Update check-wheel-size.py to read MAX_SIZE_MB from env (vllm-project#8103)

commit 855c262
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Wed Sep 4 13:22:17 2024 +0800

    [Frontend] Multimodal support in offline chat (vllm-project#8098)

commit 2be8ec6
Author: Peter Salas <peter@fixie.ai>
Date:   Tue Sep 3 21:38:21 2024 -0700

    [Model] Add Ultravox support for multiple audio chunks (vllm-project#7963)

commit e16fa99
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Tue Sep 3 22:12:41 2024 -0400

    [Misc] Update fbgemmfp8 to use `vLLMParameters` (vllm-project#7972)

    Co-authored-by: Michael Goin <michael@neuralmagic.com>

commit 61f4a93
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Tue Sep 3 18:35:33 2024 -0700

    [TPU][Bugfix] Use XLA rank for persistent cache path (vllm-project#8137)

commit d4db9f5
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Tue Sep 3 17:57:41 2024 -0700

    [Benchmark] Add `--async-engine` option to benchmark_throughput.py (vllm-project#7964)

commit 2188a60
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Tue Sep 3 17:21:44 2024 -0400

    [Misc] Update `GPTQ` to use `vLLMParameters` (vllm-project#7976)

commit dc0b606
Author: Simon Mo <simon.mo@hey.com>
Date:   Tue Sep 3 14:11:42 2024 -0700

    [CI] Change PR remainder to avoid at-mentions (vllm-project#8134)

commit 0af3abe
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Tue Sep 3 13:29:24 2024 -0700

    [TPU][Bugfix] Fix next_token_ids shape (vllm-project#8128)

commit f1575dc
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Tue Sep 3 13:25:09 2024 -0700

    [ci] Fix GHA workflow  (vllm-project#8129)

    Signed-off-by: kevin <kevin@anyscale.com>

commit c02638e
Author: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Date:   Tue Sep 3 22:37:08 2024 +0300

    [CI/Build] make pip install vllm work in macos (for import only) (vllm-project#8118)

commit 652c83b
Author: Antoni Baum <antoni.baum@protonmail.com>
Date:   Tue Sep 3 12:28:25 2024 -0700

    [Misc] Raise a more informative exception in add/remove_logger (vllm-project#7750)

commit 6d646d0
Author: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Date:   Tue Sep 3 14:50:29 2024 -0400

    [Core] Optimize Async + Multi-step (vllm-project#8050)

commit 95a178f
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Tue Sep 3 11:32:27 2024 -0700

    [CI] Only PR reviewers/committers can trigger CI on PR (vllm-project#8124)

    Signed-off-by: kevin <kevin@anyscale.com>

commit bd852f2
Author: Cody Yu <hao.yu.cody@gmail.com>
Date:   Tue Sep 3 10:49:18 2024 -0700

    [Performance] Enable chunked prefill and prefix caching together (vllm-project#8120)

    Co-authored-by: Tao He <sighingnow@gmail.com>
    Co-authored-by: Juelianqvq <Juelianqvq@noreply.github.com>

commit ec26653
Author: Isotr0py <2037008807@qq.com>
Date:   Tue Sep 3 21:37:52 2024 +0800

    [Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend (vllm-project#8061)

commit 0fbc669
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Mon Sep 2 20:35:42 2024 -0700

    [Bugfix] Fix single output condition in output processor (vllm-project#7881)

commit 6e36f4f
Author: wang.yuqi <noooop@126.com>
Date:   Tue Sep 3 05:20:12 2024 +0800

    improve chunked prefill performance

    [Bugfix] Fix vllm-project#7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (vllm-project#7874)

commit dd2a6a8
Author: Isotr0py <2037008807@qq.com>
Date:   Mon Sep 2 23:48:56 2024 +0800

    [Bugfix] Fix internlm2 tensor parallel inference (vllm-project#8055)

commit 4ca65a9
Author: Isotr0py <2037008807@qq.com>
Date:   Mon Sep 2 20:43:26 2024 +0800

    [Core][Bugfix] Accept GGUF model without .gguf extension (vllm-project#8056)

commit e2b2aa5
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Sun Sep 1 23:09:46 2024 -0700

    [TPU] Align worker index with node boundary (vllm-project#7932)

commit e6a26ed
Author: Lily Liu <lilyliupku@gmail.com>
Date:   Sun Sep 1 21:23:29 2024 -0700

    [SpecDecode][Kernel] Flashinfer Rejection Sampling (vllm-project#7244)

commit f8d6014
Author: Shawn Tan <shawn@wtf.sg>
Date:   Sun Sep 1 21:37:18 2024 -0400

    [Model] Add Granite model (vllm-project#7436)

    Co-authored-by: Nick Hill <nickhill@us.ibm.com>

commit 5b86b19
Author: Roger Wang <136131678+ywang96@users.noreply.github.com>
Date:   Sun Sep 1 14:46:57 2024 -0700

    [Misc] Optional installation of audio related packages (vllm-project#8063)

commit 5231f08
Author: Roger Wang <136131678+ywang96@users.noreply.github.com>
Date:   Sat Aug 31 16:35:53 2024 -0700

    [Frontend][VLM] Add support for multiple multi-modal items (vllm-project#8049)

commit 8423aef
Author: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Date:   Sat Aug 31 15:44:03 2024 -0400

    [BugFix][Core] Multistep Fix Crash on Request Cancellation (vllm-project#8059)

commit 4f5d844
Author: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Date:   Sat Aug 31 09:27:58 2024 +0200

    [Bugfix] Fix ModelScope models in v0.5.5 (vllm-project#8037)

commit d05f0a9
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sat Aug 31 13:26:55 2024 +0800

    [Bugfix] Fix import error in Phi-3.5-MoE (vllm-project#8052)

commit 622f8ab
Author: Pavani Majety <pmajety@nvidia.com>
Date:   Fri Aug 30 22:18:50 2024 -0700

    [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (vllm-project#8013)

commit 1248e85
Author: Wenxiang <8460860+wenxcs@users.noreply.github.com>
Date:   Sat Aug 31 03:42:57 2024 +0800

    [Model] Adding support for MSFT Phi-3.5-MoE (vllm-project#7729)

    Co-authored-by: Your Name <you@example.com>
    Co-authored-by: Zeqi Lin <zelin@microsoft.com>
    Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>

commit 2684efc
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Fri Aug 30 09:01:26 2024 -0700

    [TPU][Bugfix] Fix tpu type api (vllm-project#8035)

commit 058344f
Author: Kaunil Dhruv <dhruv.kaunil@gmail.com>
Date:   Fri Aug 30 08:21:02 2024 -0700

    [Frontend]-config-cli-args (vllm-project#7737)

    Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
    Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>

commit 98cef6a
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Fri Aug 30 23:20:34 2024 +0800

    [Core] Increase default `max_num_batched_tokens` for multimodal models (vllm-project#8028)

commit f97be32
Author: Jungho Christopher Cho <wjdgh6655@gmail.com>
Date:   Sat Aug 31 00:19:27 2024 +0900

    [VLM][Model] TP support for ViTs (vllm-project#7186)

    Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
    Co-authored-by: Roger Wang <ywang@roblox.com>

commit afd39a4
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Fri Aug 30 23:03:28 2024 +0800

    [Bugfix] Fix import error in Exaone model (vllm-project#8034)

commit 2148441
Author: Richard Liu <39319471+richardsliu@users.noreply.github.com>
Date:   Fri Aug 30 00:27:40 2024 -0700

    [TPU] Support single and multi-host TPUs on GKE (vllm-project#7613)

commit dc13e99
Author: Yohan Na <nayohan13@gmail.com>
Date:   Fri Aug 30 15:34:20 2024 +0900

    [MODEL] add Exaone model support (vllm-project#7819)

commit 34a0e96
Author: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Date:   Fri Aug 30 11:11:39 2024 +0700

    [Kernel] changing fused moe kernel chunk size default to 32k (vllm-project#7995)

commit 80c7b08
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Thu Aug 29 19:35:29 2024 -0700

    [TPU] Async output processing for TPU (vllm-project#8011)

commit 428dd14
Author: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Date:   Thu Aug 29 22:19:08 2024 -0400

    [Core] Logprobs support in Multi-step (vllm-project#7652)

commit 4abed65
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Fri Aug 30 08:49:04 2024 +0800

    [VLM] Disallow overflowing `max_model_len` for multimodal models (vllm-project#7998)

commit 0c785d3
Author: Wei-Sheng Chin <wechi@microsoft.com>
Date:   Thu Aug 29 16:48:11 2024 -0700

    Add more percentiles and latencies (vllm-project#7759)

commit 4664cea
Author: chenqianfzh <51831990+chenqianfzh@users.noreply.github.com>
Date:   Thu Aug 29 16:09:08 2024 -0700

    support bitsandbytes 8-bit and FP4 quantized models (vllm-project#7445)

commit 257afc3
Author: Harsha vardhan manoj Bikki <39381063+hbikki@users.noreply.github.com>
Date:   Thu Aug 29 13:58:14 2024 -0700

    [Neuron] Adding support for context-lenght, token-gen buckets. (vllm-project#7885)

    Co-authored-by: Harsha Bikki <harbikh@amazon.com>

commit 86a677d
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Thu Aug 29 16:46:55 2024 -0400

    [misc] update tpu int8 to use new vLLM Parameters (vllm-project#7973)

commit d78789a
Author: Isotr0py <2037008807@qq.com>
Date:   Fri Aug 30 03:54:49 2024 +0800

    [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (vllm-project#7954)

commit c334b18
Author: kushanam <42385577+kushanam@users.noreply.github.com>
Date:   Thu Aug 29 12:15:04 2024 -0700

    extend cuda graph size for H200 (vllm-project#7894)

    Co-authored-by: youkaichao <youkaichao@126.com>

commit 6b34215
Author: Pavani Majety <pavanimajety@gmail.com>
Date:   Thu Aug 29 11:53:11 2024 -0700

    [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend.  + BugFix for kv_cache_dtype=auto (vllm-project#7985)

    Co-authored-by: Simon Mo <simon.mo@hey.com>
    Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

commit 3f60f22
Author: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Date:   Thu Aug 29 14:18:26 2024 -0400

    [Core] Combine async postprocessor and multi-step (vllm-project#7921)

commit f205c09
Author: Jonas M. Kübler <44084297+jmkuebler@users.noreply.github.com>
Date:   Thu Aug 29 07:18:13 2024 +0200

    [Bugfix] Unify rank computation across regular decoding and speculative decoding (vllm-project#7899)

commit ef99a78
Author: youkaichao <youkaichao@gmail.com>
Date:   Wed Aug 28 21:27:06 2024 -0700

    Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (vllm-project#7982)

commit 74d5543
Author: Peter Salas <peter@fixie.ai>
Date:   Wed Aug 28 20:24:31 2024 -0700

    [VLM][Core] Fix exceptions on ragged NestedTensors (vllm-project#7974)

commit a7f65c2
Author: youkaichao <youkaichao@gmail.com>
Date:   Wed Aug 28 17:32:26 2024 -0700

    [torch.compile] remove reset (vllm-project#7975)

commit 4289cad
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Wed Aug 28 17:22:43 2024 -0700

    [Frontend] Minor optimizations to zmq decoupled front-end (vllm-project#7957)

    Co-authored-by: Robert Shaw <rshaw@neuralmagic>

commit af59df0
Author: Michael Goin <michael@neuralmagic.com>
Date:   Wed Aug 28 19:19:17 2024 -0400

    Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (vllm-project#7961)

commit ce6bf3a
Author: youkaichao <youkaichao@gmail.com>
Date:   Wed Aug 28 16:10:12 2024 -0700

    [torch.compile] avoid Dynamo guard evaluation overhead (vllm-project#7898)

    Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

commit 3cdfe1f
Author: bnellnm <49004751+bnellnm@users.noreply.github.com>
Date:   Wed Aug 28 18:11:49 2024 -0400

    [Bugfix] Make torch registration of punica ops optional (vllm-project#7970)

commit fdd9daa
Author: Mor Zusman <mor.zusmann@gmail.com>
Date:   Thu Aug 29 01:06:52 2024 +0300

    [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (vllm-project#7651)

commit 8c56e57
Author: Stas Bekman <stas00@users.noreply.github.com>
Date:   Wed Aug 28 13:54:23 2024 -0700

    [Doc] fix 404 link (vllm-project#7966)

commit eeffde1
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date:   Wed Aug 28 13:10:21 2024 -0700

    [TPU] Upgrade PyTorch XLA nightly (vllm-project#7967)

commit e5697d1
Author: rasmith <Randall.Smith@amd.com>
Date:   Wed Aug 28 14:37:47 2024 -0500

    [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (vllm-project#7386)

commit b98cc28
Author: Pavani Majety <pavanimajety@gmail.com>
Date:   Wed Aug 28 10:01:22 2024 -0700

    [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (vllm-project#7798)

    Co-authored-by: Simon Mo <simon.mo@hey.com>

commit ef9baee
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Wed Aug 28 23:11:18 2024 +0800

    [Bugfix][VLM] Fix incompatibility between vllm-project#7902 and vllm-project#7230 (vllm-project#7948)

commit 98c12cf
Author: Stas Bekman <stas00@users.noreply.github.com>
Date:   Wed Aug 28 05:12:32 2024 -0700

    [Doc] fix the autoAWQ example (vllm-project#7937)

commit f52a43a
Author: youkaichao <youkaichao@gmail.com>
Date:   Wed Aug 28 01:27:07 2024 -0700

    [ci][test] fix pp test failure (vllm-project#7945)

commit e358053
Author: Cody Yu <hao.yu.cody@gmail.com>
Date:   Wed Aug 28 00:36:31 2024 -0700

    [Performance] Enable chunked prefill and prefix caching together (vllm-project#7753)
@kylesayrs kylesayrs deleted the kylesayrs/activation-ordering branch September 12, 2024 03:36
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 12, 2024
Jeffwan pushed a commit to aibrix/vllm that referenced this pull request Sep 19, 2024
MengqingCao pushed a commit to MengqingCao/vllm that referenced this pull request Sep 30, 2024
siddharth9820 pushed a commit to axonn-ai/vllm that referenced this pull request Sep 30, 2024
MengqingCao added a commit to MengqingCao/vllm that referenced this pull request Oct 10, 2024
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (vllm-project#8272)

[Frontend] Add progress reporting to run_batch.py (vllm-project#8060)

Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io>

[Bugfix] Correct adapter usage for cohere and jamba (vllm-project#8292)

[Misc] GPTQ Activation Ordering (vllm-project#8135)

[Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (vllm-project#8319)

[Bugfix] Fix missing `post_layernorm` in CLIP (vllm-project#8155)

[CI/Build] enable ccache/scccache for HIP builds (vllm-project#8327)

[Frontend] Clean up type annotations for mistral tokenizer (vllm-project#8314)

[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (vllm-project#8130)

Fix ppc64le buildkite job (vllm-project#8309)

[Spec Decode] Move ops.advance_step to flash attn advance_step (vllm-project#8224)

[Misc] remove peft as dependency for prompt models (vllm-project#8162)

[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (vllm-project#8342)

[Bugfix] lookahead block table with cuda graph max capture (vllm-project#8340)

[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (vllm-project#8340)

[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (vllm-project#8172)

[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (vllm-project#8043)

[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (vllm-project#8329)

[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (vllm-project#8299)

[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (vllm-project#6112)

[model] Support for Llava-Next-Video model (vllm-project#7559)

Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (vllm-project#8347)

[Model][VLM] Add Qwen2-VL model support (vllm-project#7905)

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (vllm-project#7257)

[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (vllm-project#8373)

[Bugfix] Add missing attributes in mistral tokenizer (vllm-project#8364)

[Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917)

Co-authored-by: Sage Moore <sage@neuralmagic.com>

[Misc] Move device options to a single place (vllm-project#8322)

[Speculative Decoding] Test refactor (vllm-project#8317)

Co-authored-by: youkaichao <youkaichao@126.com>

Pixtral (vllm-project#8377)

Co-authored-by: Roger Wang <ywang@roblox.com>

Bump version to v0.6.1 (vllm-project#8379)

[MISC] Dump model runner inputs when crashing (vllm-project#8305)

[misc] remove engine_use_ray (vllm-project#8126)

[TPU] Use Ray for default distributed backend (vllm-project#8389)

Fix the AMD weight loading tests (vllm-project#8390)

[Bugfix]: Fix the logic for deciding if tool parsing is used (vllm-project#8366)

[Gemma2] add bitsandbytes support for Gemma2 (vllm-project#8338)

[Misc] Raise error when using encoder/decoder model with cpu backend (vllm-project#8355)

[Misc] Use RoPE cache for MRoPE (vllm-project#8396)

[torch.compile] hide slicing under custom op for inductor (vllm-project#8384)

[Hotfix][VLM] Fixing max position embeddings for Pixtral (vllm-project#8399)

[Bugfix] Fix InternVL2 inference with various num_patches (vllm-project#8375)

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Model] Support multiple images for qwen-vl (vllm-project#8247)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (vllm-project#8403)

[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (vllm-project#8423)

[Bugfix] Offline mode fix (vllm-project#8376)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[multi-step] add flashinfer backend (vllm-project#7928)

[Core] Add engine option to return only deltas or final output (vllm-project#7381)

[Bugfix] multi-step + flashinfer: ensure cuda graph compatible  (vllm-project#8427)

[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (vllm-project#8425)

[CI/Build] Disable multi-node test for InternVL2 (vllm-project#8428)

[Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415)

[Bugfix] Fix weight loading issue by rename variable. (vllm-project#8293)

[Misc] Update Pixtral example (vllm-project#8431)

[BugFix] fix group_topk (vllm-project#8430)

[Core] Factor out input preprocessing to a separate class (vllm-project#7329)

[Bugfix] Mapping physical device indices for e2e test utils (vllm-project#8290)

[Bugfix] Bump fastapi and pydantic version (vllm-project#8435)

[CI/Build] Update pixtral tests to use JSON (vllm-project#8436)

[Bugfix] Fix async log stats (vllm-project#8417)

[bugfix] torch profiler bug for single gpu with GPUExecutor (vllm-project#8354)

bump version to v0.6.1.post1 (vllm-project#8440)

[CI/Build] Enable InternVL2 PP test only on single node (vllm-project#8437)

[doc] recommend pip instead of conda (vllm-project#8446)

[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (vllm-project#8442)

[misc][ci] fix quant test (vllm-project#8449)

[Installation] Gate FastAPI version for Python 3.8 (vllm-project#8456)

[plugin][torch.compile] allow to add custom compile backend (vllm-project#8445)

[CI/Build] Reorganize models tests (vllm-project#7820)

[Doc] Add oneDNN installation to CPU backend documentation (vllm-project#8467)

[HotFix] Fix final output truncation with stop string + streaming (vllm-project#8468)

bump version to v0.6.1.post2 (vllm-project#8473)

[Hardware][intel GPU] bump up ipex version to 2.3 (vllm-project#8365)

Co-authored-by: Yan Ma <yan.ma@intel.com>

[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (vllm-project#8310)

[Model] support minicpm3 (vllm-project#8297)

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[torch.compile] fix functionalization (vllm-project#8480)

[torch.compile] add a flag to disable custom op (vllm-project#8488)

[TPU] Implement multi-step scheduling (vllm-project#8489)

[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (vllm-project#8490)

[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (vllm-project#8357)

[Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032)

Co-authored-by: Dipika <dipikasikka1@gmail.com>

[Frontend] Expose revision arg in OpenAI server (vllm-project#8501)

[BugFix] Fix clean shutdown issues (vllm-project#8492)

[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506)

[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270)

[doc] update doc on testing and debugging (vllm-project#8514)

[Bugfix] Bind api server port before starting engine (vllm-project#8491)

[perf bench] set timeout to debug hanging (vllm-project#8516)

[misc] small qol fixes for release process (vllm-project#8517)

[Bugfix] Fix 3.12 builds on main (vllm-project#8510)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[refactor] remove triton based sampler (vllm-project#8524)

[Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521)

[torch.compile] register allreduce operations as custom ops (vllm-project#8526)

[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495)

[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631)

[Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434)

[Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527)

[Bugfix] Fix TP > 1 for new granite (vllm-project#8544)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[doc] improve installation doc (vllm-project#8550)

Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>

[CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520)

[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012)

[CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540)

[Misc] Add argument to disable FastAPI docs (vllm-project#8554)

[CI/Build] Avoid CUDA initialization (vllm-project#8534)

[CI/Build] Update Ruff version (vllm-project#8469)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

[Core] *Prompt* logprobs support in Multi-step (vllm-project#8199)

[Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

[Model] Support Solar Model (vllm-project#8386)

Co-authored-by: Michael Goin <michael@neuralmagic.com>

[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380)

Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

[Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039)

[BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572)

[Bugfix] add `dead_error` property to engine client (vllm-project#8574)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573)

Co-authored-by: lwilkinson@neuralmagic.com

[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models.  (vllm-project#8545)

Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593)

[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616)

[MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615)

[Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584)

[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577)

[Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619)

[Doc] Add documentation for GGUF quantization (vllm-project#8618)

Create SECURITY.md (vllm-project#8642)

[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551)

[Misc] guard against change in cuda library name (vllm-project#8609)

[Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571)

[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474)

[Core] Support Lora lineage and base model metadata management (vllm-project#6315)

[Model] Add OLMoE (vllm-project#7922)

[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670)

[Bugfix] Validate SamplingParam n is an int (vllm-project#8548)

[Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649)

[Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556)

[Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640)

[Doc] neuron documentation update (vllm-project#8671)

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

[Hardware][AWS] update neuron to 2.20 (vllm-project#8676)

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

[Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496)

[Core] Rename `PromptInputs` and `inputs`(vllm-project#8673)

[MISC] add support custom_op check (vllm-project#8557)

Co-authored-by: youkaichao <youkaichao@126.com>

[Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675)

[beam search] add output for manually checking the correctness (vllm-project#8684)

[Kernel] Build flash-attn from source (vllm-project#8245)

[VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687)

[Doc] Fix typo in AMD installation guide (vllm-project#8689)

[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646)

[dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518)

[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643)

[Bugfix] Refactor composite weight loading logic (vllm-project#8656)

[ci][build] fix vllm-flash-attn (vllm-project#8699)

[Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407)

[Misc] Use NamedTuple in Multi-image example (vllm-project#8705)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703)

[Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486)

Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701)

[build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713)

[misc] upgrade mistral-common (vllm-project#8715)

[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702)

[Bugfix] Fix CPU CMake build (vllm-project#8723)

Co-authored-by: Yuan <yuan.zhou@intel.com>

[Bugfix] fix docker build for xpu (vllm-project#8652)

[Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[Hardware][CPU] Refactor CPU model runner (vllm-project#8729)

[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733)

[Model] Support pp for qwen2-vl (vllm-project#8696)

[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707)

[CI/Build] use setuptools-scm to set __version__ (vllm-project#4738)

Co-authored-by: youkaichao <youkaichao@126.com>

[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701)

Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

[Kernel][LoRA]  Add assertion for punica sgmv kernels (vllm-project#7585)

[Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562)

Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335)

[Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674)

Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728)

re-implement beam search on top of vllm core (vllm-project#8726)

Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>

Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750)

[MISC] Skip dumping inputs when unpicklable (vllm-project#8744)

[Core][Model] Support loading weights by ID within models (vllm-project#7931)

[Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558)

[Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661)

Co-authored-by: mgoin <michael@neuralmagic.com>

[Frontend] Batch inference for llm.chat() API  (vllm-project#8648)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

[Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748)

[CI/Build] fix setuptools-scm usage (vllm-project#8771)

[misc] soft drop beam search (vllm-project#8763)

[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768)

[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

[Core] Adding Priority Scheduling (vllm-project#5958)

[Bugfix] Use heartbeats instead of health checks (vllm-project#8583)

Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780)

[Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776)

Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752)

[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250)

[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770)

[Bugfix] load fc bias from config for eagle (vllm-project#8790)

[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (vllm-project#8672)

[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (vllm-project#8767)

Signed-off-by: darthhexx <darthhexx@gmail.com>

[Misc] Fix minor typo in scheduler (vllm-project#8765)

[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (vllm-project#8777)

[Kernel] Fullgraph and opcheck tests (vllm-project#8479)

[[Misc]] Add extra deps for openai server image (vllm-project#8792)

[VLM][Bugfix] internvl with num_scheduler_steps > 1 (vllm-project#8614)

rename PromptInputs and inputs with backward compatibility (vllm-project#8760)

[Frontend] MQLLMEngine supports profiling. (vllm-project#8761)

[Misc] Support FP8 MoE for compressed-tensors (vllm-project#8588)

Revert "rename PromptInputs and inputs with backward compatibility (vllm-project#8760) (vllm-project#8810)

[Model] Add support for the multi-modal Llama 3.2 model (vllm-project#8811)

Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

[Doc] Update doc for Transformers 4.45 (vllm-project#8817)

[Misc] Support quantization of MllamaForCausalLM (vllm-project#8822)

[Misc] Update config loading for Qwen2-VL and remove Granite (vllm-project#8837)

[Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814)

[Docs] Add README to the build docker image (vllm-project#8825)

[CI/Build] Fix missing ci dependencies (vllm-project#8834)

[misc][installation] build from source without compilation (vllm-project#8818)

[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872)

Signed-off-by: kevin <kevin@anyscale.com>

[Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861)

[Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820)

[Bugfix] Fix print_warning_once's line info (vllm-project#8867)

fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568)

[Bugfix] Fixup advance_step.cu warning (vllm-project#8815)

[BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829)

[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764)

[Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343)

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

[Core] rename`PromptInputs` and `inputs` (vllm-project#8876)

[misc] fix collect env (vllm-project#8894)

[MISC] Fix invalid escape sequence '\' (vllm-project#8830)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

[Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892)

[TPU] Update pallas.py to support trillium (vllm-project#8871)

[torch.compile] use empty tensor instead of None for profiling (vllm-project#8875)

[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271)

[Bugfix] fix for deepseek w4a16 (vllm-project#8906)

Co-authored-by: mgoin <michael@neuralmagic.com>

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378)

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

[misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911)

[Core] Priority-based scheduling in async engine (vllm-project#8850)

[misc] fix wheel name (vllm-project#8919)

[Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824)

Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Co-authored-by: youkaichao <youkaichao@126.com>

[Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921)

[Bugfix] Fix code for downloading models from modelscope (vllm-project#8443)

[Bugfix] Fix PP for Multi-Step (vllm-project#8887)

[CI/Build] Update models tests & examples (vllm-project#8874)

Co-authored-by: Roger Wang <ywang@roblox.com>

[Frontend] Make beam search emulator temperature modifiable (vllm-project#8928)

Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr>

[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891)

[doc] organize installation doc and expose per-commit docker (vllm-project#8931)

[Core] Improve choice of Python multiprocessing method (vllm-project#8823)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: youkaichao <youkaichao@126.com>

[Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824)

[Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741)

Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

[CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925)

[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930)

[Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896)

[Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199)

[BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870)

Co-authored-by: Roger Wang <ywang@roblox.com>

[Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944)

[Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942)

[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants