[Bugfix] Mamba cache Cuda Graph padding #6214

tomeras91 · 2024-07-08T13:49:36Z

The current Jamba implementation doesn't allow use of Cuda Graph with batch sizes that weren't captured. When trying to do that, a RuntimeError is raised with incompatible tensor shapes. This happens because the mamba cache wasn't padded to the Cuda Graph batch size.

This PR fixes this issue, and adds a test to assert it.

…ape and not by number of sequences. This is so we pad the mamba cache to the captured CG batch sizes

tomeras91 added 3 commits July 8, 2024 15:50

bugfix: when working in CG mode, batch size should be by input_ids sh…

b966736

…ape and not by number of sequences. This is so we pad the mamba cache to the captured CG batch sizes

Add relevant test

612df99

rename batch_szoe -> cg_batch_size for clarity

e0b49e4

simon-mo approved these changes Jul 8, 2024

View reviewed changes

simon-mo merged commit ddc369f into vllm-project:main Jul 8, 2024
68 of 70 checks passed

tlrmchlsmth added a commit to neuralmagic/nm-vllm that referenced this pull request Jul 16, 2024

apply fix from vllm-project#6214

fb846ce

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 17, 2024

[Bugfix] Mamba cache Cuda Graph padding (vllm-project#6214)

8e41e34

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Bugfix] Mamba cache Cuda Graph padding (vllm-project#6214)

5053bc7

tomeras91 deleted the mamba-cg-cache-padding branch August 12, 2024 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Mamba cache Cuda Graph padding #6214

[Bugfix] Mamba cache Cuda Graph padding #6214

tomeras91 commented Jul 8, 2024

[Bugfix] Mamba cache Cuda Graph padding #6214

[Bugfix] Mamba cache Cuda Graph padding #6214

Conversation

tomeras91 commented Jul 8, 2024