[Model] Bert Embedding Model #5447

laishzh · 2024-06-12T09:56:41Z

Implement Bert Embedding Model

This PR implements the Bert Embedding Model which discussed in #5179.

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

robertgshaw2-neuralmagic · 2024-06-12T19:04:17Z

vllm/model_executor/models/bert_embedding.py

+    ) -> Optional[PoolerOutput]:
+        return self._pooler(hidden_states, pooling_metadata)
+
+    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):


Can you move this function down to the bottom of the file

OK, leave a TODO here. I will move this Class to the bottom later.

robertgshaw2-neuralmagic · 2024-06-12T19:09:23Z

vllm/model_executor/models/bert_embedding.py

+        super().__init__()
+        self.size = config.hidden_size
+
+        self.word_embeddings = nn.Embedding(config.vocab_size,


Could you look into using VocabParallelEmbedding from our parallel layers?

Using the nn.Embedding will not work with tensor parallelism

OK, I noticed this feature before. Already updated.

robertgshaw2-neuralmagic · 2024-06-12T19:10:16Z

vllm/model_executor/models/bert_embedding.py

+                                                  config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size,
+                                      eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)


Do we need dropout for inference?

All dropouts have been removed.

robertgshaw2-neuralmagic · 2024-06-12T19:12:17Z

vllm/model_executor/models/bert_embedding.py

+            bias=True,
+            quant_config=quant_config)
+
+        self.attn = Attention(


We need to use bidirectional attention here rather than causal attention

Do you mean to implement a new type of attention in vllm? Not sure if there is a way to use the current attention implementation with different parameters.

@robertgshaw2-neuralmagic Hi, I'd like to hear your thoughts on this point. The framework of BERT model is almost completed. But the BertSelfAttention output differs from transformers. After diving into the Attention implementation, there are massive changes of Attention are needed to use bidirectional attention.
I also found some related discussions(#3117 (comment)). I think this PR depends on #4942. What's your opinions?

@laishzh sorry missed this note!

Yes. Pull in the BertSelfAttention from that PR. #4942 should land this week

robertgshaw2-neuralmagic · 2024-06-12T19:23:38Z

Thanks! You're on the right track here

mgoin · 2024-06-12T20:07:44Z

vllm/model_executor/models/bert_embedding.py

+            for (param_name, weight_name, shard_id) in stacked_params_mapping:
+                if weight_name not in name:
+                    continue
+                name = name.replace(weight_name, param_name)
+
+                # Skip loading extra bias for GPTQ models.
+                if name.endswith(".bias") and name not in params_dict:
+                    continue
+
+                param = params_dict[name]
+                weight_loader = param.weight_loader
+                weight_loader(param, loaded_weight, shard_id)
+                break
+            else:


Is this a for-else loop? Please break this up as I find these pretty confusing

Yes, I also felt confused at the first time. I rewote this part just now. It's supposed to be easier to understand. Have any further suggestions?

mgoin · 2024-06-12T20:09:09Z

vllm/model_executor/models/bert_embedding.py

+                                                config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
+                                                  config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size,


Follow naming standards for layernorm

Suggested change

self.LayerNorm = nn.LayerNorm(config.hidden_size,

self.layernorm = nn.LayerNorm(config.hidden_size,

Followed the suggestion, and added renaming of LayerNorm parameter when loading weights.

mgoin · 2024-06-12T20:10:40Z

vllm/model_executor/models/bert_embedding.py

+        self.LayerNorm = nn.LayerNorm(config.hidden_size,
+                                      eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)


ditto on layernorm and dropout

mgoin · 2024-06-12T20:10:57Z

vllm/model_executor/models/bert_embedding.py

+        self.LayerNorm = nn.LayerNorm(config.hidden_size,
+                                      eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)


ditto on layernorm and dropout

laishzh · 2024-06-13T13:18:24Z

Thanks! You're on the right track here

Wow! Really appreciate your guidance.

mgoin · 2024-06-13T17:20:15Z

vllm/model_executor/models/bert_embedding.py

+        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
+                                                config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
+                                                  config.hidden_size)


Should these be VocabParallelEmbedding as well?

…er_reviews

…el_runner_reviews

…r the particular architecture alias used by bart-large-cnn

…to tests/kernels/utils.py from vllm/utils.py

# Conflicts: # vllm/core/embedding_model_block_manager.py

laishzh · 2024-09-09T15:27:44Z

@robertgshaw2-neuralmagic @mgoin Hi, just update this work.
Because the BertEmbeddingModel is encoder-only architecture. I refactored the EmbeddingModelRunner as child class of EncoderDecoderModelRunner to reuse the encoder-decoder framework.

laishzh · 2024-09-09T15:34:44Z

vllm/worker/embedding_model_runner.py

        # Prepare PoolingMetadata.
-        assert model_input.seq_lens is not None
+        seq_lens = model_input.seq_lens\
+            if not self.model_config.is_encoder_model \


@robertgshaw2-neuralmagic @mgoin Here still have a question. How to determine precisely which model is encoder model or decoder model. As an example, there is lake of fields is_decoder_model or is_encoder_model in config.json of Mistral Model which is implemented before. Here is the link: config.json of Mistral.

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

maxdebayser · 2024-09-23T22:37:47Z

Hi @laishzh, I'm working on another PR that is based on yours. In case it helps, I've solved the latest merge conflicts of your branch with main here: https://github.com/maxdebayser/vllm/tree/bert

# Conflicts: # vllm/inputs/data.py

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser · 2024-10-02T21:04:09Z

@laishzh @robertgshaw2-neuralmagic , I've solved that lastest merge conflicts here: https://github.com/maxdebayser/vllm/tree/bert

laishzh · 2024-10-07T04:28:24Z

@robertgshaw2-neuralmagic Please draw your attention. I just revert the change to EmbeddingModelBlockManager. And there is still leaving a problem of how to distinguish the model arch(Encoder-Only, or Decoder-Only model), which is to determine the seq_len when pooling(#5447 (comment)). Please let me know if any changes are needed.

robertgshaw2-neuralmagic · 2024-10-11T00:26:03Z

Hey @laishzh - there were a few issues with this PR. Specifically, it uses the Encoder-Decoder pathway and the BertModel is not implemented properly (loading is not canonical, tensor parallelism does not work, and the pooling logic is not correct)

I need get this landed ASAP, so I finished off the PR here. #9056

I will added you as a co-author.

laishzh · 2024-10-11T12:02:11Z

Hey @laishzh - there were a few issues with this PR. Specifically, it uses the Encoder-Decoder pathway and the BertModel is not implemented properly (loading is not canonical, tensor parallelism does not work, and the pooling logic is not correct)

I need get this landed ASAP, so I finished off the PR here. #9056

I will added you as a co-author.

I'm OK. Very glad to see this feature could be supported in vllm. I will delve into those points that you mentioned. Thanks for your work!

DarkLight1337 · 2024-10-23T08:43:02Z

Closing as superseded by #9056

laishzh force-pushed the main branch from c07e15f to b81fb8a Compare June 12, 2024 10:05

robertgshaw2-neuralmagic self-requested a review June 12, 2024 19:03

robertgshaw2-neuralmagic reviewed Jun 12, 2024

View reviewed changes

mgoin reviewed Jun 12, 2024

View reviewed changes

mgoin reviewed Jun 19, 2024

View reviewed changes

afeldman-nm added 12 commits June 21, 2024 20:13

wip

f2dac1c

BART almost passing profile_run()

59caabe

wip bart

b8d5637

Merge branch 'main' into infra_enc_dec_cross_attn_reviews

5ce2dd0

Merge branch 'infra_enc_dec_cross_attn' into infra_enc_dec_model_runn…

8b8c409

…er_reviews

Merge branch 'infra_enc_dec_model_runner_bart' into infra_enc_dec_mod…

3b95225

…el_runner_reviews

BART passes profile run

7d2fcf9

fixed prompt processing bug that was preventing inference from starting

6fd4c02

Merge branch 'main' into infra_enc_dec_model_runner_reviews

d58e8c8

wip bart-cnn summarization example

8f9ee62

fixed a number of bugs related to BART decode-phase; added support fo…

2d8429e

…r the particular architecture alias used by bart-large-cnn

Merge branch 'main' into infra_enc_dec_model_runner_reviews

b7ff75f

robertgshaw2-neuralmagic mentioned this pull request Jun 25, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Closed

46 tasks

afeldman-nm added 7 commits June 25, 2024 02:13

BART e2e test runs but does not pass

919bf88

Merge branch 'main' into infra_enc_dec_model_runner_reviews

753bab0

Merge branch 'main' into infra_enc_dec_cross_attn_reviews

125e5dc

removed extra line

597526a

changed nested if/else to elif/else in xformers mask computation code

a178b7a

reorganized helper functions that were only being used for testing in…

06c7f75

…to tests/kernels/utils.py from vllm/utils.py

removed attention_type

47c9f39

laishzh marked this pull request as ready for review August 19, 2024 07:42

feat: modify test_embedding

612cf1a

maxdebayser mentioned this pull request Aug 28, 2024

Roberta embedding #7969

Closed

laishzh added 5 commits September 8, 2024 23:50

feat: bert embedding implemented, but still have some bugs with mistral,

e351bfd

feat: some changes on test_embedding.py

3ff2d36

Merge branch 'main' of https://github.com/vllm-project/vllm

776dcbd

# Conflicts: # vllm/core/embedding_model_block_manager.py

feat: fix lint

0ea4da1

feat: fix lint

15be7fa

laishzh commented Sep 9, 2024

View reviewed changes

noooop mentioned this pull request Sep 18, 2024

[RFC]: Support encode only models by Workflow Defined Engine #8453

Closed

1 task

Merge branch 'main' into bert

2c8a5b9

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

laishzh and others added 2 commits September 26, 2024 23:23

Merge remote-tracking branch 'origin/main'

3fbfdf4

# Conflicts: # vllm/inputs/data.py

Merge branch 'upstream_main' into bert

57bdd60

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

simon-mo mentioned this pull request Oct 1, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

40 tasks

Merge branch 'upstream_main' into bert

107d9c2

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

laishzh added 2 commits October 7, 2024 00:45

Merge remote-tracking branch 'maxdebayser/bert'

352d8b2

feat: revert embedding_block_manager

04b0bc6

laishzh requested review from DarkLight1337 and ywang96 as code owners October 7, 2024 02:45

laishzh added 2 commits October 7, 2024 12:01

Merge branch 'origin/main'

6440795

feat: update with origin/main

80c1885

DarkLight1337 mentioned this pull request Oct 7, 2024

Support BERTModel (first encoder-only embedding model) #9056

Merged

DarkLight1337 closed this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Bert Embedding Model #5447

[Model] Bert Embedding Model #5447

laishzh commented Jun 12, 2024 •

edited

Loading

robertgshaw2-neuralmagic Jun 12, 2024

laishzh Jun 13, 2024 •

edited

Loading

robertgshaw2-neuralmagic Jun 12, 2024

laishzh Jun 13, 2024

robertgshaw2-neuralmagic Jun 12, 2024

laishzh Jun 13, 2024

robertgshaw2-neuralmagic Jun 12, 2024

laishzh Jun 13, 2024

laishzh Jun 19, 2024 •

edited

Loading

robertgshaw2-neuralmagic Jun 26, 2024

robertgshaw2-neuralmagic commented Jun 12, 2024

mgoin Jun 12, 2024

laishzh Jun 13, 2024

mgoin Jun 12, 2024

laishzh Jun 13, 2024

mgoin Jun 12, 2024

mgoin Jun 12, 2024

laishzh commented Jun 13, 2024

mgoin Jun 13, 2024

laishzh commented Sep 9, 2024

laishzh Sep 9, 2024

maxdebayser commented Sep 23, 2024

maxdebayser commented Oct 2, 2024

laishzh commented Oct 7, 2024

robertgshaw2-neuralmagic commented Oct 11, 2024

laishzh commented Oct 11, 2024

DarkLight1337 commented Oct 23, 2024

	self.LayerNorm = nn.LayerNorm(config.hidden_size,
	self.layernorm = nn.LayerNorm(config.hidden_size,

[Model] Bert Embedding Model #5447

[Model] Bert Embedding Model #5447

Conversation

laishzh commented Jun 12, 2024 • edited Loading

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

Choose a reason for hiding this comment

laishzh Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laishzh Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laishzh commented Jun 13, 2024

Choose a reason for hiding this comment

laishzh commented Sep 9, 2024

Choose a reason for hiding this comment

maxdebayser commented Sep 23, 2024

maxdebayser commented Oct 2, 2024

laishzh commented Oct 7, 2024

robertgshaw2-neuralmagic commented Oct 11, 2024

laishzh commented Oct 11, 2024

DarkLight1337 commented Oct 23, 2024

laishzh commented Jun 12, 2024 •

edited

Loading

laishzh Jun 13, 2024 •

edited

Loading

laishzh Jun 19, 2024 •

edited

Loading