Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roberta embedding #7969

Closed
wants to merge 639 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
639 commits
Select commit Hold shift + click to select a range
919bf88
BART e2e test runs but does not pass
afeldman-nm Jun 25, 2024
753bab0
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jun 25, 2024
125e5dc
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 25, 2024
597526a
removed extra line
afeldman-nm Jun 25, 2024
a178b7a
changed nested if/else to elif/else in xformers mask computation code
afeldman-nm Jun 25, 2024
06c7f75
reorganized helper functions that were only being used for testing in…
afeldman-nm Jun 25, 2024
47c9f39
removed attention_type
afeldman-nm Jun 25, 2024
2f0b05b
typing and formatting
afeldman-nm Jun 25, 2024
d23c284
typing and formatting; fixed escape sequences in comments
afeldman-nm Jun 25, 2024
1a6e5a3
moved make_tensor_with_pad() helper function back to vllm.utils
afeldman-nm Jun 25, 2024
e2a46e3
formatting
afeldman-nm Jun 25, 2024
d43141f
merge; a lot of formatting fixes to bart code but not fully passing
afeldman-nm Jun 25, 2024
5169a2a
removed unnecessary positions arguments from BART encoder, decoder fo…
afeldman-nm Jun 25, 2024
4400d77
some reformatting
afeldman-nm Jun 25, 2024
e61385d
fixed bug caused by overzealous refactoring
afeldman-nm Jun 25, 2024
41e31e8
BART with new explanatory comments & passing formatting tests
afeldman-nm Jun 25, 2024
ba4e2c1
Removed unnecessary position arguments from BART routine; formatting
afeldman-nm Jun 25, 2024
4dabe19
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 25, 2024
a5c28fc
Merge branch 'infra_enc_dec_cross_attn' into infra_enc_dec_model_runn…
afeldman-nm Jun 25, 2024
7ca0d7a
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 26, 2024
c24697f
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 27, 2024
75756b9
removed redundant elif
afeldman-nm Jun 27, 2024
bcccc34
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 27, 2024
c8f8d59
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 27, 2024
a501849
reverted unnecessarily vllm/utils.py changes
afeldman-nm Jun 27, 2024
83d474e
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 28, 2024
64981b5
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 28, 2024
8d36458
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 29, 2024
5ff9c76
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jun 30, 2024
2828aa7
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jul 1, 2024
65e47db
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jul 3, 2024
44c6270
manually merged BART code in from previous modelrunner attempt, it wo…
afeldman-nm Jul 3, 2024
b085795
Merge branch 'infra_enc_dec_cross_attn' into infra_enc_dec_model_runner2
afeldman-nm Jul 3, 2024
ba09fbc
refactored where a number of constants are stored, primarily constant…
afeldman-nm Jul 3, 2024
2f0eb9b
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jul 3, 2024
d81662c
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jul 4, 2024
22d013c
Merge branch 'infra_enc_dec_cross_attn' into infra_enc_dec_model_runner2
afeldman-nm Jul 4, 2024
13f5b50
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jul 5, 2024
5dbebbc
Update vllm/attention/backends/torch_sdpa.py
afeldman-nm Jul 8, 2024
07df0e1
Update vllm/attention/layer.py
afeldman-nm Jul 8, 2024
7e0bc57
Merge branch 'main' into infra_enc_dec_cross_attn_reviews
afeldman-nm Jul 8, 2024
e837a73
Merge branch 'infra_enc_dec_cross_attn_reviews' into infra_enc_dec_cr…
afeldman-nm Jul 8, 2024
7ce9a51
merged in first pieces of woosuk feedback & latest main; formatting
afeldman-nm Jul 8, 2024
9ae6728
fixed specific point-changes requested by woosuk
afeldman-nm Jul 8, 2024
a1bf652
test_encoder_decoder_attn.py cleanup
afeldman-nm Jul 8, 2024
4f27946
tests/kernels/utils.py cleanup
afeldman-nm Jul 8, 2024
5ee30fe
vllm/attention/backends/abstract.py cleanup
afeldman-nm Jul 8, 2024
45fc9f7
vllm/attention/backends/blocksparse_attn.py cleanup
afeldman-nm Jul 8, 2024
097aff2
vllm/attention/backends/flash_attn.py cleanup
afeldman-nm Jul 8, 2024
d8a692b
cleaning up a number of backends & backends utils.py
afeldman-nm Jul 8, 2024
5df73fc
xformers backend cleanup
afeldman-nm Jul 8, 2024
6cd595c
formatting
afeldman-nm Jul 8, 2024
db49d48
Merge branch 'infra_enc_dec_cross_attn' into infra_enc_dec_model_runner2
afeldman-nm Jul 8, 2024
88e284a
merge from main
afeldman-nm Jul 8, 2024
c90140f
Merge branch 'main' into infra_enc_dec_model_runner2
afeldman-nm Jul 8, 2024
bd14d29
wip scheduler
afeldman-nm Jul 9, 2024
2c80185
formatting
afeldman-nm Jul 9, 2024
4c01f13
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 9, 2024
c95adf5
scheduler supports encoder-/cross-attention & passes existing schedul…
afeldman-nm Jul 9, 2024
d1343aa
scheduler test passes
afeldman-nm Jul 9, 2024
b4a461d
formatting
afeldman-nm Jul 9, 2024
6a71f8f
formatting
afeldman-nm Jul 9, 2024
fe7786c
Merge remote-tracking branch 'bert_deps/afeldman-nm/infra_enc_dec_mod…
laishzh Jul 10, 2024
9a63f51
wip model runner
afeldman-nm Jul 10, 2024
f649944
Merge branch 'main' into infra_enc_dec_model_runner
afeldman-nm Jul 10, 2024
685604c
wip modelrunner
afeldman-nm Jul 12, 2024
9c898f5
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 12, 2024
196f30c
enc/dec decoder test working, sans sampling check
afeldman-nm Jul 12, 2024
c5ceb23
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 13, 2024
9ce2da4
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 13, 2024
447a5c7
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 15, 2024
3d5bb88
EncoderDecoderModelInput correctly handles encoder token/position fields
afeldman-nm Jul 15, 2024
db5539a
format
afeldman-nm Jul 15, 2024
760355b
bart test skipped on CPU version of vllm
afeldman-nm Jul 15, 2024
590a240
Formatting
afeldman-nm Jul 15, 2024
8b8d981
refactored AttentionType and related imports; skip BART test definiti…
afeldman-nm Jul 15, 2024
ff940f7
formatting
afeldman-nm Jul 15, 2024
64d7198
wip
afeldman-nm Jul 15, 2024
0cca164
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 15, 2024
94c083c
Merge branch 'infra_enc_dec_model_runner_reviews' into infra_enc_dec_…
afeldman-nm Jul 15, 2024
83c5c43
prompt type checks
afeldman-nm Jul 15, 2024
10ed714
Format
afeldman-nm Jul 15, 2024
78d3d3c
modified LLM.generate() error message
afeldman-nm Jul 15, 2024
6c95380
wip engine is_encoder_decoder() setting
afeldman-nm Jul 15, 2024
304caed
formatting
afeldman-nm Jul 15, 2024
7b0803b
formatting?
afeldman-nm Jul 15, 2024
5525511
Sequence may be constructed with encoder/decoder LLMInput configurations
afeldman-nm Jul 15, 2024
dd4031c
wip but having wllm.commit_id error
afeldman-nm Jul 15, 2024
8dccaa5
correctly constructing enc/dec sequences
afeldman-nm Jul 15, 2024
336a77d
formatting
afeldman-nm Jul 15, 2024
46397c7
wip
afeldman-nm Jul 15, 2024
f85997b
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 15, 2024
251f899
wip
afeldman-nm Jul 15, 2024
9141347
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 15, 2024
ddaf0ad
wip
afeldman-nm Jul 16, 2024
54ff142
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 16, 2024
92d9f48
conftest: encoder/decoder example prompts
afeldman-nm Jul 16, 2024
c5846ac
Hfrunner greedy logprobs limit
afeldman-nm Jul 16, 2024
374880f
input preparation now includes encoder-oriented input setup:
afeldman-nm Jul 16, 2024
796d7a3
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 16, 2024
42ac66b
VllmRunner encoder/decoder methods
afeldman-nm Jul 16, 2024
850a97e
bart parallel vocab
afeldman-nm Jul 16, 2024
3c7e19d
zip enc/dec prompts; formatting
afeldman-nm Jul 16, 2024
e534ffc
wip
afeldman-nm Jul 16, 2024
97d81f0
encoder/decoder input processing; formatting
afeldman-nm Jul 16, 2024
87ed3b6
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 16, 2024
713d095
incorporated encoder sequence into request-add functionality
afeldman-nm Jul 16, 2024
aea8d34
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 17, 2024
159c7bc
fixed decoder-only bug
afeldman-nm Jul 17, 2024
16c9aa2
bugfix
afeldman-nm Jul 17, 2024
03aea18
wip
afeldman-nm Jul 17, 2024
ef80c85
wip
afeldman-nm Jul 17, 2024
f8dd4a5
fixed scheduler bug
afeldman-nm Jul 17, 2024
c2ff615
format
afeldman-nm Jul 17, 2024
31127fa
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 17, 2024
1c6e06d
bugfix
afeldman-nm Jul 17, 2024
0cc14ab
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 17, 2024
3656dc6
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 17, 2024
aee5f16
fixed sequence bug
afeldman-nm Jul 17, 2024
ef94623
added examples utils w/ context manager for backend override; applied…
afeldman-nm Jul 17, 2024
50ad5ff
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 17, 2024
b277180
formatting
afeldman-nm Jul 17, 2024
cac6283
added encoder/decoder example to examples test
afeldman-nm Jul 17, 2024
f54f276
wip refactoring
afeldman-nm Jul 17, 2024
597a07d
refactor
afeldman-nm Jul 17, 2024
9f5a02c
RequestOutput & SequenceGroup now include encoder prompt in output, a…
afeldman-nm Jul 17, 2024
94c904f
wip parallel bart but encountering GPU count issue
afeldman-nm Jul 17, 2024
9da8fb3
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 17, 2024
1f8c52f
tweaks to enc/dec example
afeldman-nm Jul 17, 2024
1808846
formatting
afeldman-nm Jul 17, 2024
f15eacf
wip
afeldman-nm Jul 17, 2024
6c940f8
modified HF behavior in BART test to be truly greedy
afeldman-nm Jul 17, 2024
949ac02
formatting
afeldman-nm Jul 17, 2024
88c058e
wip parallelizing BART
afeldman-nm Jul 17, 2024
31e335f
wip activation parallelization
afeldman-nm Jul 17, 2024
c092ed4
merged in upstream changes; left some formatting issues which I expec…
afeldman-nm Jul 17, 2024
d7bd617
Merge branch 'infra_enc_dec_model_runner' into infra_enc_dec_model_ru…
afeldman-nm Jul 17, 2024
69f0379
wip:
afeldman-nm Jul 17, 2024
9fdd047
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 17, 2024
584c01e
Merge branch 'infra_enc_dec_model_runner_reviews' into infra_enc_dec_…
afeldman-nm Jul 17, 2024
7ace684
Merge remote-tracking branch 'bert_deps/afeldman-nm/infra_enc_dec_mod…
laishzh Jul 18, 2024
41ccf0c
wip merge
afeldman-nm Jul 20, 2024
ffa99b2
additional merge
afeldman-nm Jul 20, 2024
a22f56c
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 22, 2024
c00e0a8
CommonMetadataBuilder sets block_tables constructor arg of metadata
afeldman-nm Jul 22, 2024
32967c1
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 22, 2024
a33b501
Merge branch 'infra_enc_dec_model_runner' into infra_enc_dec_model_ru…
afeldman-nm Jul 22, 2024
a16cabb
equalized some generation/sampling config settings between enc/dec HF…
afeldman-nm Jul 22, 2024
abbb427
Merge branch 'infra_enc_dec_model_runner' into infra_enc_dec_model_ru…
afeldman-nm Jul 22, 2024
00198a6
BART MLPs parallelized
afeldman-nm Jul 22, 2024
fb3227f
parallelized BART learned positional embedding
afeldman-nm Jul 22, 2024
e5bb9de
all attention layer output linears are parallelized
afeldman-nm Jul 22, 2024
74abe22
encoder attention & decoder self-attention parallelized
afeldman-nm Jul 22, 2024
9bbed43
parallelized LM head
afeldman-nm Jul 22, 2024
fdf71de
parallelized enc/dec cross-attention, using a slight hack
afeldman-nm Jul 22, 2024
3551b6b
fixed bug where underlying Attention was constructed using full head-…
afeldman-nm Jul 22, 2024
b174c7a
bart is parallelized, modulo an unfortunate hack for QKVParallelLinea…
afeldman-nm Jul 22, 2024
c43a6ed
commented out BART TP=4
afeldman-nm Jul 22, 2024
a408289
Merge remote-tracking branch 'bert_deps/afeldman-nm/infra_enc_dec_mod…
laishzh Jul 22, 2024
b90b6b6
upstream merge
afeldman-nm Jul 22, 2024
14831b0
Merge branch 'infra_enc_dec_model_runner_reviews' into infra_enc_dec_…
afeldman-nm Jul 22, 2024
427032a
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 22, 2024
c51a168
fixed bug in how conftest was handling HF encoder/decoder outputs; di…
afeldman-nm Jul 23, 2024
b01937f
set up None/empty str tests which are not passing
afeldman-nm Jul 23, 2024
48a742d
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 23, 2024
b283544
Merge branch 'infra_enc_dec_model_runner_correctness' into infra_enc_…
afeldman-nm Jul 23, 2024
059273f
wip
afeldman-nm Jul 23, 2024
229847b
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 23, 2024
7e7bbd9
deleted unnecessary dependency
afeldman-nm Jul 23, 2024
4a6e39e
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 24, 2024
aa01d71
empty-string decoder input is now handled for encoder/decoder
afeldman-nm Jul 24, 2024
0b29fd2
enc/dec handles empty str and None decoder prompts correctly
afeldman-nm Jul 24, 2024
dd784b5
typing fix
afeldman-nm Jul 24, 2024
61d2ad2
fixed bugs in handling non-text formats for individual prompts
afeldman-nm Jul 24, 2024
f36ffb5
example includes prompt zipper
afeldman-nm Jul 24, 2024
c493d40
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 24, 2024
be58d8a
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 24, 2024
02114bd
_free_seq_group() -> _free_seq_group_cross_attn_blocks()
afeldman-nm Jul 24, 2024
5a270ff
refactoring
afeldman-nm Jul 24, 2024
ed4a56b
formatting
afeldman-nm Jul 24, 2024
4b5b2cf
removed unnecessary argument reordering
afeldman-nm Jul 24, 2024
d82b273
enc/dec example comments'
afeldman-nm Jul 24, 2024
0af58ec
responses to feedback
afeldman-nm Jul 24, 2024
bed9bcd
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 25, 2024
47b4eb2
fixed bug caused by upstream refactoring
afeldman-nm Jul 25, 2024
393515e
formatting
afeldman-nm Jul 25, 2024
fb5a2bc
upstream merge
afeldman-nm Jul 25, 2024
c2cc010
Removed lora from enc/dec model runner
afeldman-nm Jul 25, 2024
175ea95
Merge branch 'main' into infra_enc_dec_model_runner_reviews
afeldman-nm Jul 25, 2024
3327e5b
removed lora & vision & mm code from enc/dec modelrunner
afeldman-nm Jul 25, 2024
47c5548
checked out examples/offline_inference.py from main
afeldman-nm Jul 25, 2024
1bb7ad9
updated RequestOutput docstring
afeldman-nm Jul 25, 2024
035d90d
updated RequestOutput docstring
afeldman-nm Jul 25, 2024
64685ac
Sequence docstring
afeldman-nm Jul 25, 2024
d1751db
removed flashinfer references from enc/dec modelrunner
afeldman-nm Jul 25, 2024
f0abcc2
format
afeldman-nm Jul 25, 2024
4bb7fc4
removed chunked prefill logic/docstring text from enc/dec modelrunner
afeldman-nm Jul 25, 2024
a936faa
removed prefix caching from enc/dec modelrunner
afeldman-nm Jul 25, 2024
59bf8c4
Merge remote-tracking branch 'bert_deps/afeldman-nm/infra_enc_dec_mod…
laishzh Jul 25, 2024
12a9869
Merge remote-tracking branch 'origin/main'
laishzh Aug 13, 2024
53c5148
(WIP)feat: EmbeddingModelRunner support encoder model
laishzh Aug 13, 2024
63fb7a5
WIP: bert embedding
laishzh Aug 13, 2024
37bcba0
feat: full pipeline
laishzh Aug 14, 2024
76b47fb
chore: recover
laishzh Aug 15, 2024
aca786e
feat: default bos_token_id of encoder model
laishzh Aug 15, 2024
682c455
feat: recover sequence
laishzh Aug 15, 2024
872e795
feat: embedding model forward
laishzh Aug 15, 2024
a0ad0df
chore: recover unchanged files
laishzh Aug 16, 2024
f215884
chore: recover
laishzh Aug 16, 2024
7657af3
feat: fix lint
laishzh Aug 16, 2024
91e23d8
feat: fix lint
laishzh Aug 16, 2024
0b3f55c
feat: fix lint
laishzh Aug 16, 2024
275f49d
feat: embedding model prompt
laishzh Aug 16, 2024
ce9a599
feat: bos_token_id
laishzh Aug 16, 2024
7e1196d
fix: fix hint
laishzh Aug 17, 2024
b99d783
feat: remove embedding block space manager
laishzh Aug 17, 2024
b76da51
feat: enc_dec_runner base
laishzh Aug 19, 2024
e15d0cc
Merge branch 'main' into main
laishzh Aug 19, 2024
8b107a2
feat: fix lint
laishzh Aug 19, 2024
bfd7ec9
feat: model input
laishzh Aug 19, 2024
6f006f5
chore: fix lint
laishzh Aug 19, 2024
37f698b
feat: move BertEmbeddingModel to the end of file
laishzh Aug 19, 2024
d098607
feat: remove embedding_model_block_manager.py
laishzh Aug 19, 2024
fc1f2b7
chore: fix lint
laishzh Aug 19, 2024
612cf1a
feat: modify test_embedding
laishzh Aug 27, 2024
7d0ecb9
Add support for Roberta embedding models
maxdebayser Aug 28, 2024
e351bfd
feat: bert embedding implemented, but still have some bugs with mistral,
laishzh Sep 8, 2024
3ff2d36
feat: some changes on test_embedding.py
laishzh Sep 9, 2024
776dcbd
Merge branch 'main' of https://github.com/vllm-project/vllm
laishzh Sep 9, 2024
0ea4da1
feat: fix lint
laishzh Sep 9, 2024
15be7fa
feat: fix lint
laishzh Sep 9, 2024
afd997b
Merge branch '5447' into roberta_embedding
maxdebayser Sep 23, 2024
464a90f
Merge branch 'main' into bert
maxdebayser Sep 23, 2024
30c875e
Merge branch 'bert' into roberta_embedding
maxdebayser Sep 23, 2024
2c8a5b9
Merge branch 'main' into bert
maxdebayser Sep 23, 2024
08f1781
add head size 32
maxdebayser Sep 23, 2024
3fbfdf4
Merge remote-tracking branch 'origin/main'
laishzh Sep 26, 2024
57bdd60
Merge branch 'upstream_main' into bert
maxdebayser Sep 26, 2024
a14b4e3
Merge branch 'bert' into roberta_embedding
maxdebayser Sep 26, 2024
107d9c2
Merge branch 'upstream_main' into bert
maxdebayser Oct 2, 2024
e7044a6
Merge branch 'bert' into roberta_embedding
maxdebayser Oct 2, 2024
352d8b2
Merge remote-tracking branch 'maxdebayser/bert'
laishzh Oct 6, 2024
04b0bc6
feat: revert embedding_block_manager
laishzh Oct 6, 2024
6440795
Merge branch 'origin/main'
laishzh Oct 7, 2024
80c1885
feat: update with origin/main
laishzh Oct 7, 2024
30b0f21
Merge branch 'upstream_main' into bert
maxdebayser Oct 8, 2024
5793373
Merge branch 'bert' into roberta_embedding
maxdebayser Oct 8, 2024
935c58d
add registry of encoder-only models
maxdebayser Oct 11, 2024
ddbae13
Merge branch 'upstream_main' into roberta_embedding
maxdebayser Oct 11, 2024
44a4c04
Merge branch 'upstream_main' into roberta_embedding
maxdebayser Oct 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions csrc/attention/attention_kernels.cu
Original file line number Diff line number Diff line change
Expand Up @@ -739,6 +739,9 @@ void paged_attention_v1_launcher(
// NOTE(woosuk): To reduce the compilation time, we only compile for the
// head sizes that we use in the model. However, we can easily extend this
// to support any head size which is a multiple of 16.
case 32:
LAUNCH_PAGED_ATTENTION_V1(32);
break;
case 64:
LAUNCH_PAGED_ATTENTION_V1(64);
break;
Expand Down Expand Up @@ -903,6 +906,9 @@ void paged_attention_v2_launcher(
// NOTE(woosuk): To reduce the compilation time, we only compile for the
// head sizes that we use in the model. However, we can easily extend this
// to support any head size which is a multiple of 16.
case 32:
LAUNCH_PAGED_ATTENTION_V2(32);
break;
case 64:
LAUNCH_PAGED_ATTENTION_V2(64);
break;
Expand Down
6 changes: 6 additions & 0 deletions csrc/cpu/attention.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,9 @@ void paged_attention_v1_impl_launcher(
int* seq_lens_ptr = seq_lens.data_ptr<int>();

switch (head_size) {
case 32:
LAUNCH_V1_ATTENTION_KERNEL(T, 64, BLOCK_SIZE);
break;
case 64:
LAUNCH_V1_ATTENTION_KERNEL(T, 64, BLOCK_SIZE);
break;
Expand Down Expand Up @@ -692,6 +695,9 @@ void paged_attention_v2_impl_launcher(
int* seq_lens_ptr = seq_lens.data_ptr<int>();

switch (head_size) {
case 32:
LAUNCH_V2_ATTENTION_KERNEL(T, 64, BLOCK_SIZE);
break;
case 64:
LAUNCH_V2_ATTENTION_KERNEL(T, 64, BLOCK_SIZE);
break;
Expand Down
16 changes: 16 additions & 0 deletions examples/offline_inference_bert_embedding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from vllm import LLM

# Sample prompts.
prompts = [
"This is an example sentence.",
"Another example sentence.",
]

# Create an LLM.
model = LLM(model="bert-base-uncased", enforce_eager=True)
outputs = model.encode(prompts)

# Print the outputs.
for output in outputs:
print(output.outputs.embedding) # list of 768 floats
print(len(output.outputs.embedding))
13 changes: 10 additions & 3 deletions examples/offline_inference_embedding.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,22 @@
from vllm import LLM
from vllm.inputs import build_decoder_prompts

# Sample prompts.
prompts = [
prompts = build_decoder_prompts([
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
])

# Create an LLM.
model = LLM(model="intfloat/e5-mistral-7b-instruct", enforce_eager=True)
model = LLM(
model="intfloat/e5-mistral-7b-instruct",
enforce_eager=True,
# NOTE: sliding_window is not supported by encoder_decoder_model
disable_sliding_window=True,
gpu_memory_utilization=0.95,
)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
Expand Down
36 changes: 30 additions & 6 deletions tests/models/embedding/language/test_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,22 @@
import torch
import torch.nn.functional as F

from vllm.inputs import build_decoder_prompts

MODELS = [
"intfloat/e5-mistral-7b-instruct",
"BAAI/bge-multilingual-gemma2",
{
"name": "intfloat/e5-mistral-7b-instruct",
"is_decoder_only": True
},
{
"name": "BAAI/bge-multilingual-gemma2",
"is_decoder_only": True
},
{
"name": "bert-base-uncased",
"is_decoder_only": False,
"max_model_len": 512
},
]


Expand All @@ -26,7 +39,7 @@ def test_models(
hf_runner,
vllm_runner,
example_prompts,
model: str,
model: dict,
dtype: str,
) -> None:
# The example_prompts has ending "\n", for example:
Expand All @@ -37,11 +50,22 @@ def test_models(
# So we need to strip the input texts to avoid test failing.
example_prompts = [str(s).strip() for s in example_prompts]

with hf_runner(model, dtype=dtype, is_embedding_model=True) as hf_model:
model_name = model["name"]
is_decoder_only = model["is_decoder_only"]
max_model_len = model.get("max_model_len", 1024)
with hf_runner(model_name, dtype=dtype,
is_embedding_model=True) as hf_model:
hf_outputs = hf_model.encode(example_prompts)

with vllm_runner(model, dtype=dtype) as vllm_model:
vllm_outputs = vllm_model.encode(example_prompts)
with vllm_runner(
model_name,
dtype=dtype,
disable_sliding_window=True,
max_model_len=max_model_len,
) as vllm_model:
prompt_inputs = build_decoder_prompts(
example_prompts) if is_decoder_only else example_prompts
vllm_outputs = vllm_model.encode(prompt_inputs)

similarities = compare_embeddings(hf_outputs, vllm_outputs)
all_similarities = torch.stack(similarities)
Expand Down
2 changes: 1 addition & 1 deletion vllm/attention/ops/paged_attn.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ class PagedAttention:

@staticmethod
def get_supported_head_sizes() -> List[int]:
return [64, 80, 96, 112, 120, 128, 192, 256]
return [32, 64, 80, 96, 112, 120, 128, 192, 256]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: It's strange that just adding another head size here makes the code run. Perhaps this is actually a silent failure and the actual kernel has to be added somewhere.


@staticmethod
def get_kv_cache_shape(
Expand Down
4 changes: 4 additions & 0 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,10 @@ def is_encoder_decoder_model(self) -> bool:
(hasattr(self.hf_config, "text_config") and getattr(
self.hf_config.text_config, "is_encoder_decoder", False)))

@property
def is_encoder_model(self) -> bool:
return ModelRegistry.is_encoder_model(self.hf_config.architectures)

@property
def is_embedding_model(self) -> bool:
"""Extract the embedding model flag."""
Expand Down
7 changes: 7 additions & 0 deletions vllm/core/placeholder_block_space_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,16 @@ def free(self, seq: Sequence) -> None:
# No operation on free
return

def free_cross(self, seq: Sequence) -> None:
# No operation on free
return

def get_block_table(self, seq: Sequence) -> List[int]:
return None # type: ignore

def get_cross_block_table(self, seq: Sequence) -> List[int]:
return None # type: ignore

def get_num_free_gpu_blocks(self) -> int:
return 1

Expand Down
7 changes: 5 additions & 2 deletions vllm/inputs/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
from .data import (EncoderDecoderLLMInputs, ExplicitEncoderDecoderPrompt,
LLMInputs, PromptType, SingletonPrompt, TextPrompt,
TokensPrompt, build_explicit_enc_dec_prompt,
to_enc_dec_tuple_list, zip_enc_dec_prompts)
TokensPrompt, build_decoder_prompt, build_decoder_prompts,
build_explicit_enc_dec_prompt, to_enc_dec_tuple_list,
zip_enc_dec_prompts)
from .registry import InputContext, InputRegistry

INPUT_REGISTRY = InputRegistry()
Expand All @@ -21,6 +22,8 @@
"ExplicitEncoderDecoderPrompt",
"LLMInputs",
"EncoderDecoderLLMInputs",
"build_decoder_prompt",
"build_decoder_prompts",
"build_explicit_enc_dec_prompt",
"to_enc_dec_tuple_list",
"zip_enc_dec_prompts",
Expand Down
12 changes: 12 additions & 0 deletions vllm/inputs/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,18 @@ def to_enc_dec_tuple_list(
for enc_dec_prompt in enc_dec_prompts]


def build_decoder_prompt(
prompt: _T2, ) -> ExplicitEncoderDecoderPrompt[SingletonPrompt, _T2]:
return build_explicit_enc_dec_prompt(encoder_prompt="",
decoder_prompt=prompt)


def build_decoder_prompts(
prompts: Iterable[_T2],
) -> List[ExplicitEncoderDecoderPrompt[SingletonPrompt, _T2]]:
return [build_decoder_prompt(prompt) for prompt in prompts]


def __getattr__(name: str):
if name == "PromptInput":
import warnings
Expand Down
19 changes: 14 additions & 5 deletions vllm/inputs/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
DecoderPromptComponents = Tuple[Optional[str], Optional[List[int]],
Optional["MultiModalDataDict"],
Optional[Dict[str, Any]]]
_DEFAULT_BOS_TOKEN_ID = 1


class InputPreprocessor:
Expand Down Expand Up @@ -54,7 +55,13 @@ def get_bos_token_id(self,
"is not initialized")
return None

return self.tokenizer.get_lora_tokenizer(lora_request).bos_token_id
bos_token_id = self.tokenizer.get_lora_tokenizer(
lora_request).bos_token_id

if bos_token_id is None and self.model_config.is_encoder_model:
bos_token_id = _DEFAULT_BOS_TOKEN_ID

return bos_token_id

def get_eos_token_id(self,
lora_request: Optional[LoRARequest] = None
Expand Down Expand Up @@ -86,9 +93,10 @@ def get_decoder_start_token_id(self) -> Optional[int]:
dec_start_token_id = getattr(self.model_config.hf_config,
'decoder_start_token_id', None)
if dec_start_token_id is None:
print_warning_once("Falling back on <BOS> for decoder start token "
"id because decoder start token id is not "
"available.")
if not self.model_config.is_encoder_model:
logger.warning(
"Falling back on <BOS> for decoder start token id "
"because decoder start token id is not available.")
dec_start_token_id = self.get_bos_token_id()

return dec_start_token_id
Expand Down Expand Up @@ -577,4 +585,5 @@ async def preprocess_async(
)

def is_encoder_decoder_model(self):
return self.model_config.is_encoder_decoder_model
return self.model_config.is_encoder_decoder_model \
or self.model_config.is_encoder_model
12 changes: 12 additions & 0 deletions vllm/model_executor/layers/pooler.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ class PoolingType(IntEnum):
"""Enumeration for different types of pooling methods."""
LAST = 0
ALL = 1
MEAN = 2


class Pooler(nn.Module):
Expand Down Expand Up @@ -50,6 +51,17 @@ def forward(
for prompt_len in prompt_lens:
pooled_data.append(hidden_states[offset:offset + prompt_len])
offset += prompt_len
elif self.pooling_type == PoolingType.MEAN:
# Calculate mean pooling
cumsum = torch.cumsum(hidden_states, dim=0)
start_indices = torch.cat([
torch.tensor([0], device=hidden_states.device),
torch.cumsum(prompt_lens[:-1], dim=0)
])
end_indices = torch.cumsum(prompt_lens, dim=0)
pooled_data = (
cumsum[end_indices - 1] - cumsum[start_indices] +
hidden_states[start_indices]) / prompt_lens.unsqueeze(1)
else:
raise ValueError(f"Invalid pooling type: {self.pooling_type}")

Expand Down
Loading