Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

[Rel Eng] Upstream sync 2024 06 11 #298

Merged
merged 93 commits into from
Jun 11, 2024
Merged
Changes from 1 commit
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
4b41095
[CI/Build] CMakeLists: build all extensions' cmake targets at the sam…
dtrifiro Jun 1, 2024
045812f
[Kernel] Refactor CUTLASS kernels to always take scales that reside o…
tlrmchlsmth Jun 1, 2024
db09745
[Kernel] Update Cutlass fp8 configs (#5144)
varun-sundar-rabindranath Jun 1, 2024
46b6b26
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> sav…
dashanji Jun 1, 2024
5b5c2b9
[Bugfix] Fix call to init_logger in openai server (#4765)
NadavShmayo Jun 1, 2024
cb6b7a0
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776)
chenqianfzh Jun 1, 2024
9c2a759
[Bugfix] Remove deprecated @abstractproperty (#5174)
zhuohan123 Jun 1, 2024
fd82eff
[Bugfix]: Fix issues related to prefix caching example (#5177) (#5180)
Delviet Jun 1, 2024
5b6b8ed
[BugFix] Prevent `LLM.encode` for non-generation Models (#5184)
robertgshaw2-redhat Jun 1, 2024
15650a3
Update test_ignore_eos (#4898)
simon-mo Jun 2, 2024
dc64b07
[Frontend][OpenAI] Support for returning max_model_len on /v1/models …
Avinash-Raj Jun 2, 2024
bfc6bc7
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#…
divakar-amd Jun 2, 2024
5008643
[Misc] Simplify code and fix type annotations in `conftest.py` (#5118)
DarkLight1337 Jun 2, 2024
c070e44
[Core] Support image processor (#4197)
DarkLight1337 Jun 3, 2024
314398c
[Core] Remove unnecessary copies in flash attn backend (#5138)
Yard1 Jun 3, 2024
1ebb772
[Kernel] Pass a device pointer into the quantize kernel for the scale…
tlrmchlsmth Jun 3, 2024
48e8e3f
[CI/BUILD] enable intel queue for longer CPU tests (#4113)
zhouyuan Jun 3, 2024
a6f0725
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834)
Kaiyang-Chen Jun 3, 2024
198d784
New CI template on AWS stack (#5110)
khluu Jun 3, 2024
1923dcb
[FRONTEND] OpenAI `tools` support named functions (#5032)
br3no Jun 3, 2024
fa0bba2
[Bugfix] Support `prompt_logprobs==0` (#5217)
toslunar Jun 4, 2024
d8b71e3
[Bugfix] Add warmup for prefix caching example (#5235)
zhuohan123 Jun 4, 2024
1d88071
[Kernel] Enhance MoE benchmarking & tuning script (#4921)
WoosukKwon Jun 4, 2024
7899055
[Bugfix]: During testing, use pytest monkeypatch for safely overridin…
afeldman-nm Jun 4, 2024
0e8a84d
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecu…
zifeitong Jun 4, 2024
88368d3
[CI/Build] Add inputs tests (#5215)
DarkLight1337 Jun 4, 2024
756340a
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU b…
DamonFool Jun 4, 2024
789553f
[Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242)
WoosukKwon Jun 4, 2024
c57b71e
[CI/Build] Simplify model loading for `HfRunner` (#5251)
DarkLight1337 Jun 4, 2024
14ec8df
[CI/Build] Reducing CPU CI execution time (#5241)
bigPYJ1151 Jun 4, 2024
3b6f9d6
[CI] mark AMD test as softfail to prevent blockage (#5256)
simon-mo Jun 4, 2024
06bcc97
[Misc] Add transformers version to collect_env.py (#5259)
mgoin Jun 4, 2024
c3a46dd
[Misc] update collect env (#5261)
youkaichao Jun 4, 2024
c6bcf66
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to…
zifeitong Jun 5, 2024
f5d9197
[Misc] Add CustomOp interface for device portability (#5255)
WoosukKwon Jun 5, 2024
bbfee0c
[Misc] Fix docstring of get_attn_backend (#5271)
WoosukKwon Jun 5, 2024
47c1256
[Frontend] OpenAI API server: Add `add_special_tokens` to ChatComplet…
tomeras91 Jun 5, 2024
d619bd9
[CI] Add nightly benchmarks (#5260)
simon-mo Jun 5, 2024
2cf5911
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results…
tlrmchlsmth Jun 5, 2024
8f5fafa
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to r…
tlrmchlsmth Jun 5, 2024
0770930
[Model] Correct Mixtral FP8 checkpoint loading (#5231)
comaniac Jun 5, 2024
8310e34
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#…
DriverSong Jun 5, 2024
6e32dd4
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238)
pcmoritz Jun 5, 2024
c2c62c8
[Docs] Add Sequoia as sponsors (#5287)
simon-mo Jun 5, 2024
ee3104b
[Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252)
njhill Jun 5, 2024
1680d99
[BugFix] Fix log message about default max model length (#5284)
njhill Jun 5, 2024
efb32e1
[Bugfix] Make EngineArgs use named arguments for config construction …
mgoin Jun 5, 2024
9a28c64
[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine grace…
wuisawesome Jun 5, 2024
2b27f72
[Misc] Skip for logits_scale == 1.0 (#5291)
WoosukKwon Jun 5, 2024
54d2690
[Docs] Add Ray Summit CFP (#5295)
simon-mo Jun 5, 2024
cc2aaba
[CI] Disable flash_attn backend for spec decode (#5286)
simon-mo Jun 5, 2024
d72ae5b
[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#…
br3no Jun 5, 2024
08fd788
[CI/Build] Update vision tests (#5307)
DarkLight1337 Jun 6, 2024
cbfd3d9
Bugfix: fix broken of download models from modelscope (#5233)
liuyhwangyh Jun 6, 2024
7bb7e9b
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294)
pcmoritz Jun 6, 2024
fbd60f3
[Frontend] enable passing multiple LoRA adapters at once to generate(…
mgoldey Jun 6, 2024
14a49c2
[Core] Avoid copying prompt/output tokens if no penalties are used (#…
Yard1 Jun 7, 2024
a60515d
[Core] Change LoRA embedding sharding to support loading methods (#5038)
Yard1 Jun 7, 2024
653a080
[Misc] Missing error message for custom ops import (#5282)
DamonFool Jun 7, 2024
219a385
[Feature][Frontend]: Add support for `stream_options` in `ChatComplet…
Etelis Jun 7, 2024
bd66622
[Misc][Utils] allow get_open_port to be called for multiple times (#5…
youkaichao Jun 7, 2024
ed99ec9
[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183)
tlrmchlsmth Jun 7, 2024
50520b4
Remove Ray health check (#4693)
Yard1 Jun 7, 2024
98744f9
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#…
JamesLim-sy Jun 7, 2024
334e0a7
[Kernel] Dynamic Per-Token Activation Quantization (#5037)
dsikka Jun 7, 2024
17984a7
[Frontend] Add OpenAI Vision API Support (#5237)
ywang96 Jun 7, 2024
3da0119
[Misc] Remove unused cuda_utils.h in CPU backend (#5345)
DamonFool Jun 7, 2024
d65c3ab
fix DbrxFusedNormAttention missing cache_config (#5340)
Calvinnncy97 Jun 7, 2024
e349c2d
[Bug Fix] Fix the support check for FP8 CUTLASS (#5352)
cli99 Jun 8, 2024
4d5b699
[Misc] Add args for selecting distributed executor to benchmarks (#5335)
BKitor Jun 8, 2024
f12b636
[ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965)
hongxiayang Jun 8, 2024
842974c
[CI/Test] improve robustness of test (hf_runner) (#5347)
youkaichao Jun 8, 2024
2a16c03
[CI/Test] improve robustness of test (vllm_runner) (#5357)
youkaichao Jun 8, 2024
f8fe956
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input…
mgoin Jun 8, 2024
550ed83
[Core][CUDA Graph] add output buffer for cudagraph (#5074)
youkaichao Jun 9, 2024
52a90dd
[mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361)
youkaichao Jun 9, 2024
d20586a
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custo…
bnellnm Jun 9, 2024
27e68e9
[Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164)
BlackBird-Coding Jun 9, 2024
8f865f6
[Misc] Update to comply with the new `compressed-tensors` config (#5350)
dsikka Jun 10, 2024
d3bd135
[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API S…
ywang96 Jun 10, 2024
b21be06
[misc][typo] fix typo (#5372)
youkaichao Jun 10, 2024
1b41d11
[Misc] Improve error message when LoRA parsing fails (#5194)
DarkLight1337 Jun 10, 2024
f932e32
[Model] Initial support for LLaVA-NeXT (#4199)
DarkLight1337 Jun 10, 2024
e3f0b32
[Feature][Frontend]: Continued `stream_options` implementation also …
Etelis Jun 10, 2024
f8392d6
[Bugfix] Fix LLaVA-NeXT (#5380)
DarkLight1337 Jun 10, 2024
9d82433
[ci] Use small_cpu_queue for doc build (#5331)
khluu Jun 10, 2024
a9bd95b
[ci] Mount buildkite agent on Docker container to upload benchmark re…
khluu Jun 10, 2024
6823d9e
[Docs] Add Docs on Limitations of VLM Support (#5383)
ywang96 Jun 10, 2024
ca0ae3c
[Docs] Alphabetically sort sponsors (#5386)
WoosukKwon Jun 10, 2024
16be761
Bump version to v0.5.0 (#5384)
simon-mo Jun 10, 2024
1444822
format
Jun 11, 2024
2df326f
updated test model logprobs
Jun 11, 2024
446a144
format
Jun 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[BugFix] Prevent LLM.encode for non-generation Models (vllm-project…
…#5184)

Co-authored-by: mgoin <michael@neuralmagic.com>
2 people authored and Robert Shaw committed Jun 11, 2024
commit 5b6b8ed2123049c568b743fb1ed7a441cba1e759
10 changes: 10 additions & 0 deletions vllm/entrypoints/llm.py
Original file line number Diff line number Diff line change
@@ -285,6 +285,11 @@ def generate(
considered legacy and may be deprecated in the future. You should
instead pass them via the ``inputs`` parameter.
"""
if self.llm_engine.model_config.embedding_mode:
raise ValueError(
"LLM.generate() is only supported for generation models "
"(XForCausalLM).")

if prompt_token_ids is not None or multi_modal_data is not None:
inputs = self._convert_v1_inputs(
prompts=cast(Optional[Union[str, List[str]]], prompts),
@@ -429,6 +434,11 @@ def encode(
considered legacy and may be deprecated in the future. You should
instead pass them via the ``inputs`` parameter.
"""
if not self.llm_engine.model_config.embedding_mode:
raise ValueError(
"LLM.encode() is only supported for embedding models (XModel)."
)

if prompt_token_ids is not None or multi_modal_data is not None:
inputs = self._convert_v1_inputs(
prompts=cast(Optional[Union[str, List[str]]], prompts),