Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Jamba support #4115

Merged
merged 117 commits into from
Jul 2, 2024
Merged

[Model] Jamba support #4115

merged 117 commits into from
Jul 2, 2024

Conversation

mzusman
Copy link
Contributor

@mzusman mzusman commented Apr 16, 2024

Add Jamba support to vLLM,
This PR comprises three parts:

  1. The Jamba modeling file that encapsulates the Jamba model weights and logic itself and the mamba cache management.
  2. Passage of the requests ids of the sequence groups and their sequence ids into the modeling file in order to be able to manage the cache.
  3. Passage of the finished request ids into the modeling file as well in order the clean the allocated cache on finished requests

FIX #3690

ErezSC42 and others added 28 commits April 16, 2024 10:13
BA-78554: Jurassic 2.5

* worked on jurasic2.5 configuration file, updated jurassic2_5 modeling file to support alternating experts/attn layers

* finished working the forward pass of jurassic3.py

* finished working the forward pass of jurassic3.py

* finished working the forward pass of jurassic3.py

* jurassic_3 modeling file works, uses dummy weights initialized by "dummy" flag. Tokenizer raises issues, for now copying the mixtral tokenizer

* changed default tokenizer vocab values, loading of custom .pt weight files works.

* removed notebook

* merging master to jurassic-2.5 to reset head

* Merge branch 'master' into jurassic-2.5

* align to master

Approved-by: Tomer Asida
Approved-by: Mor Zusman
BA-78760: Jamba

* Add support for n concat and splitting

* change naming

* input_metadata is a dict list now in order to pass "n"

* clean up code from unecessary changes and prints

* Remove kv cache allocation in case of mamba layer

* Add the considerations of mamba layer cache into the num of blocks
calculation

* Delete mamba cache after profile

* Remove prints

* Cleaning

* - and not _ for requirements

Approved-by: Tomer Asida
* Remove assertion

* adapting jamba vllm to changes after hf release, working on weight loading in modeling file

* splitting the JambaDecoderLayer to JambaMambaDecoderLayer and JambaAttentionDecoderLayer

* weight loading from hf checkpoint supposedly works, might be a mixup in the MoE between the gated and non-gated weights

* Add mamba from jamba modeling file

* Remove slow forward

* Modifications to mamba_mixer

* Save changes, WIP

* Fix cache placement

* Debugging

* Additions and logging

* Jamba with mamba cache handling

* Clean up

* Another cleanup

* Use vllm's RMSNorm instead of JambaRMSNorm, Thier implementation is with
fused kernel

* Clean up and orginization of the objects to handle the mamba cache

* Shorten the code for kv cache mem

* Move cache handling inside the Mixer

* Add mamba to the wheel requirements

* Add mamba to the requirements script

* Add mamba_metadata

* Add to __init__ __all__

* Revert 2 commits

ad1a3db 'Add mamba to the requirements script'
75ed2c8 'Add mamba to the wheel requirements'

* Clean up

* Naming

* Apply whitespace suggestions from code review

* pass tie_word_embeddings to PretrainedConfig init

* Replace repeat with expand as expand doesn't require more mem

* Allocate really small cache if needed , don't use meta

* Fix for expanded

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
* Drop indecies when finish

* min 1 attention layer

* CG is working on forward pass passing

* Remove comments

* cosmetics - rename indecies -> indices, organize some whitespaces

* Add some TODOs

* Adding mamba cache for cg

* Remove useless vars from input_metadata

* Remove unused import

* Set the seqlen offset to boolean

* Return only hidden state

* Return only hidden states

* Add padding to match forward pass bs

* Is prompt instead of seqlen offset

* Remove mamba cache class (not used)

* Another remove

* Remove

* Use mamba4gc

* Fix mamba forward, run update only on non prompt

* Use 1 index after the maximal index

* Remove import

* Remove import

* typo

* typo

* place holder

* Padding and empty token takes it from the first empty place

* reformat

* Apply suggestions from code review

Whitespaces

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
* Return support for other models apart from jamba

* Support n>1

* A little cleanup

* Rename

* Apply whitespace suggestions from code review

* Add max batch size to the main func

* Fixed attention kv cache bug

* log where requests id are deleted from the dict to debug mode

* Fix typo

* Align with v0.3.3 vllm code

* Remove comments

* Take out model config from CUDAGraph object

* Fix

* Fix typo

* Make the kv cache selection cleaner

* Another typo

* Took the num layers calc outside

* Remove the -1

* Set as num layer / period

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
* Return support for other models apart from jamba

* Support n>1

* Revert 2 commits

d054737 'Support n>1'
b5167cc 'Return support for other models apart from jamba'

* TP on input and output

* Basic TP impl , working, correctness not working

* TP is working

* Roll back the verification that everything in the weights fits into the
model

* Cleanup

* Use world size func

* clean up

* Import

* Apply whitespace suggestions from code review

* Organize imports

* Add comment on the unsqueeze in conv1d

* Organize and remove redundant code in forward pass

* Remove print

* Add comments

Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

* White spaces

* Set as A

* better comment

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
@robertgshaw2-redhat
Copy link
Collaborator

Cool!

@mzusman
Copy link
Contributor Author

mzusman commented Jul 2, 2024

Tests failed due to timeouts to HF
Ready to be merged

@zhuohan123 zhuohan123 enabled auto-merge (squash) July 2, 2024 22:22
@zhuohan123 zhuohan123 merged commit 9d6a8da into vllm-project:main Jul 2, 2024
70 checks passed
prashantgupta24 pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 3, 2024
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 7, 2024
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
@@ -855,7 +857,7 @@ def step(self) -> List[Union[RequestOutput, EmbeddingRequestOutput]]:
blocks_to_copy=scheduler_outputs.blocks_to_copy,
num_lookahead_slots=scheduler_outputs.num_lookahead_slots,
running_queue_size=scheduler_outputs.running_queue_size,
)
finished_requests_ids=finished_requests_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzusman If scheduler_output.is_empty() it seems the finished request ids would be forgotten (and never actually freed). I guess you may want to call get_and_reset_finished_requests_ids() inside of the if statement?

Copy link
Contributor Author

@mzusman mzusman Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right 👍 Nice catch, I'll open a PR to fix it #6266

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now realized that this code path is not hit it seems, at least in common circumstances due to this https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L563 but probably good to fix anyways!

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Alvant <alvasian@yandex.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: Jamba (MoE Mamba from AI21)