Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync release with main @ v0.5.0.post1-99-g8720c92e #63

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
772 commits
Select commit Hold shift + click to select a range
a22dea5
[Model] Support MAP-NEO model (#5081)
xingweiqu May 31, 2024
e9d3aa0
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using th…
simon-mo May 31, 2024
a377f0b
[Misc]: optimize eager mode host time (#4196)
FuncSherl May 31, 2024
e9899fb
[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039)
comaniac May 31, 2024
6575791
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171)
njhill Jun 1, 2024
1197e02
[Build] Guard against older CUDA versions when building CUTLASS 3.x k…
tlrmchlsmth Jun 1, 2024
a360ff8
[CI/Build] CMakeLists: build all extensions' cmake targets at the sam…
dtrifiro Jun 1, 2024
260d119
[Kernel] Refactor CUTLASS kernels to always take scales that reside o…
tlrmchlsmth Jun 1, 2024
f081c3c
[Kernel] Update Cutlass fp8 configs (#5144)
varun-sundar-rabindranath Jun 1, 2024
c354072
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> sav…
dashanji Jun 1, 2024
37464a0
[Bugfix] Fix call to init_logger in openai server (#4765)
NadavShmayo Jun 1, 2024
b9c0605
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776)
chenqianfzh Jun 1, 2024
8279078
[Bugfix] Remove deprecated @abstractproperty (#5174)
zhuohan123 Jun 1, 2024
c2d6d2f
[Bugfix]: Fix issues related to prefix caching example (#5177) (#5180)
Delviet Jun 1, 2024
044793d
[BugFix] Prevent `LLM.encode` for non-generation Models (#5184)
robertgshaw2-neuralmagic Jun 1, 2024
ed59a7e
Update test_ignore_eos (#4898)
simon-mo Jun 2, 2024
f790ad3
[Frontend][OpenAI] Support for returning max_model_len on /v1/models …
Avinash-Raj Jun 2, 2024
a66cf40
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#…
divakar-amd Jun 2, 2024
dfbe60d
[Misc] Simplify code and fix type annotations in `conftest.py` (#5118)
DarkLight1337 Jun 2, 2024
7a64d24
[Core] Support image processor (#4197)
DarkLight1337 Jun 3, 2024
0ab278c
[Core] Remove unnecessary copies in flash attn backend (#5138)
Yard1 Jun 3, 2024
cbb2f59
[Kernel] Pass a device pointer into the quantize kernel for the scale…
tlrmchlsmth Jun 3, 2024
cafb8e0
[CI/BUILD] enable intel queue for longer CPU tests (#4113)
zhouyuan Jun 3, 2024
10c38e3
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834)
Kaiyang-Chen Jun 3, 2024
4f0d17c
New CI template on AWS stack (#5110)
khluu Jun 3, 2024
f775a07
[FRONTEND] OpenAI `tools` support named functions (#5032)
br3no Jun 3, 2024
06b2550
[Bugfix] Support `prompt_logprobs==0` (#5217)
toslunar Jun 4, 2024
bd0e780
[Bugfix] Add warmup for prefix caching example (#5235)
zhuohan123 Jun 4, 2024
3a434b0
[Kernel] Enhance MoE benchmarking & tuning script (#4921)
WoosukKwon Jun 4, 2024
f42a006
[Bugfix]: During testing, use pytest monkeypatch for safely overridin…
afeldman-nm Jun 4, 2024
a58f24e
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecu…
zifeitong Jun 4, 2024
ec784b2
[CI/Build] Add inputs tests (#5215)
DarkLight1337 Jun 4, 2024
87d5abe
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU b…
DamonFool Jun 4, 2024
27208be
[Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242)
WoosukKwon Jun 4, 2024
9ba093b
[CI/Build] Simplify model loading for `HfRunner` (#5251)
DarkLight1337 Jun 4, 2024
45c35f0
[CI/Build] Reducing CPU CI execution time (#5241)
bigPYJ1151 Jun 4, 2024
9ca62d8
[CI] mark AMD test as softfail to prevent blockage (#5256)
simon-mo Jun 4, 2024
650a4cc
[Misc] Add transformers version to collect_env.py (#5259)
mgoin Jun 4, 2024
fee4dcc
[Misc] update collect env (#5261)
youkaichao Jun 4, 2024
974fc9b
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to…
zifeitong Jun 5, 2024
41ca62c
[Misc] Add CustomOp interface for device portability (#5255)
WoosukKwon Jun 5, 2024
c65146e
[Misc] Fix docstring of get_attn_backend (#5271)
WoosukKwon Jun 5, 2024
f0a5005
[Frontend] OpenAI API server: Add `add_special_tokens` to ChatComplet…
tomeras91 Jun 5, 2024
d5b1eb0
[CI] Add nightly benchmarks (#5260)
simon-mo Jun 5, 2024
02cc3b5
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results…
tlrmchlsmth Jun 5, 2024
ccd4f12
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to r…
tlrmchlsmth Jun 5, 2024
5563a4d
[Model] Correct Mixtral FP8 checkpoint loading (#5231)
comaniac Jun 5, 2024
eb8fcd2
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#…
DriverSong Jun 5, 2024
51a08e7
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238)
pcmoritz Jun 5, 2024
f270a39
[Docs] Add Sequoia as sponsors (#5287)
simon-mo Jun 5, 2024
faf71bc
[Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252)
njhill Jun 5, 2024
3d33e37
[BugFix] Fix log message about default max model length (#5284)
njhill Jun 5, 2024
065aff6
[Bugfix] Make EngineArgs use named arguments for config construction …
mgoin Jun 5, 2024
0f83ddd
[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine grace…
wuisawesome Jun 5, 2024
6a7c771
[Misc] Skip for logits_scale == 1.0 (#5291)
WoosukKwon Jun 5, 2024
8f1729b
[Docs] Add Ray Summit CFP (#5295)
simon-mo Jun 5, 2024
3a6ae1d
[CI] Disable flash_attn backend for spec decode (#5286)
simon-mo Jun 5, 2024
7b0a0df
[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#…
br3no Jun 5, 2024
89c9207
[CI/Build] Update vision tests (#5307)
DarkLight1337 Jun 6, 2024
4efff03
Bugfix: fix broken of download models from modelscope (#5233)
liuyhwangyh Jun 6, 2024
abe855d
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294)
pcmoritz Jun 6, 2024
828da0d
[Frontend] enable passing multiple LoRA adapters at once to generate(…
mgoldey Jun 6, 2024
a31cab7
[Core] Avoid copying prompt/output tokens if no penalties are used (#…
Yard1 Jun 7, 2024
ccdc490
[Core] Change LoRA embedding sharding to support loading methods (#5038)
Yard1 Jun 7, 2024
1506374
[Misc] Missing error message for custom ops import (#5282)
DamonFool Jun 7, 2024
baa15a9
[Feature][Frontend]: Add support for `stream_options` in `ChatComplet…
Etelis Jun 7, 2024
388596c
[Misc][Utils] allow get_open_port to be called for multiple times (#5…
youkaichao Jun 7, 2024
8d75fe4
[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183)
tlrmchlsmth Jun 7, 2024
18a277b
Remove Ray health check (#4693)
Yard1 Jun 7, 2024
dc49fb8
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#…
JamesLim-sy Jun 7, 2024
ca3ea51
[Kernel] Dynamic Per-Token Activation Quantization (#5037)
dsikka Jun 7, 2024
7a9cb29
[Frontend] Add OpenAI Vision API Support (#5237)
ywang96 Jun 7, 2024
6840a71
[Misc] Remove unused cuda_utils.h in CPU backend (#5345)
DamonFool Jun 7, 2024
767c727
fix DbrxFusedNormAttention missing cache_config (#5340)
Calvinnncy97 Jun 7, 2024
e69ded7
[Bug Fix] Fix the support check for FP8 CUTLASS (#5352)
cli99 Jun 8, 2024
b3376e5
[Misc] Add args for selecting distributed executor to benchmarks (#5335)
BKitor Jun 8, 2024
c96fc06
[ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965)
hongxiayang Jun 8, 2024
9fb900f
[CI/Test] improve robustness of test (hf_runner) (#5347)
youkaichao Jun 8, 2024
8ea5e44
[CI/Test] improve robustness of test (vllm_runner) (#5357)
youkaichao Jun 8, 2024
c09dade
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input…
mgoin Jun 8, 2024
0373e18
[Core][CUDA Graph] add output buffer for cudagraph (#5074)
youkaichao Jun 9, 2024
5d7e3d0
[mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361)
youkaichao Jun 9, 2024
5467ac3
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custo…
bnellnm Jun 9, 2024
45f92c0
[Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164)
BlackBird-Coding Jun 9, 2024
5884c2b
[Misc] Update to comply with the new `compressed-tensors` config (#5350)
dsikka Jun 10, 2024
68bc817
[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API S…
ywang96 Jun 10, 2024
c81da5f
[misc][typo] fix typo (#5372)
youkaichao Jun 10, 2024
0bfa1c4
[Misc] Improve error message when LoRA parsing fails (#5194)
DarkLight1337 Jun 10, 2024
6b29d6f
[Model] Initial support for LLaVA-NeXT (#4199)
DarkLight1337 Jun 10, 2024
774d103
[Feature][Frontend]: Continued `stream_options` implementation also …
Etelis Jun 10, 2024
2c0d933
[Bugfix] Fix LLaVA-NeXT (#5380)
DarkLight1337 Jun 10, 2024
f7f9c5f
[ci] Use small_cpu_queue for doc build (#5331)
khluu Jun 10, 2024
c5602f0
[ci] Mount buildkite agent on Docker container to upload benchmark re…
khluu Jun 10, 2024
856c990
[Docs] Add Docs on Limitations of VLM Support (#5383)
ywang96 Jun 10, 2024
cb77ad8
[Docs] Alphabetically sort sponsors (#5386)
WoosukKwon Jun 10, 2024
114332b
Bump version to v0.5.0 (#5384)
simon-mo Jun 10, 2024
77c87be
[Doc] Add documentation for FP8 W8A8 (#5388)
mgoin Jun 11, 2024
76477a9
[ci] Fix Buildkite agent path (#5392)
khluu Jun 11, 2024
a008629
[Misc] Various simplifications and typing fixes (#5368)
njhill Jun 11, 2024
351d5e7
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defi…
maor-ps Jun 11, 2024
640052b
[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026)
DarkLight1337 Jun 11, 2024
d8f31f2
[Doc] add debugging tips (#5409)
youkaichao Jun 11, 2024
3c4cebf
[Doc][Typo] Fixing Missing Comma (#5403)
ywang96 Jun 11, 2024
8bab495
[Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389)
WoosukKwon Jun 11, 2024
246598a
[CI] docfix (#5410)
rkooo567 Jun 11, 2024
4c2ffb2
[Speculative decoding] Initial spec decode docs (#5400)
cadedaniel Jun 11, 2024
9fde251
[Doc] Add an automatic prefix caching section in vllm documentation (…
KuntaiDu Jun 11, 2024
89ec06c
[Docs] [Spec decode] Fix docs error in code example (#5427)
cadedaniel Jun 11, 2024
2e02311
[Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_siz…
jsato8094 Jun 11, 2024
00e6a2d
[Bugfix] fix lora_dtype value type in arg_utils.py (#5398)
c3-ali Jun 11, 2024
dcbf428
[Frontend] Customizable RoPE theta (#5197)
sasha0552 Jun 11, 2024
c4bd03c
[Core][Distributed] add same-node detection (#5369)
youkaichao Jun 11, 2024
99dac09
[Core][Doc] Default to multiprocessing for single-node distributed ca…
njhill Jun 11, 2024
8f89d72
[Doc] add common case for long waiting time (#5430)
youkaichao Jun 11, 2024
3dd6853
[CI/Build] Add `is_quant_method_supported` to control quantization te…
mgoin Jun 12, 2024
e3c12bf
Revert "[CI/Build] Add `is_quant_method_supported` to control quantiz…
simon-mo Jun 12, 2024
847cdcc
[CI] Upgrade codespell version. (#5381)
rkooo567 Jun 12, 2024
1a8bfd9
[Hardware] Initial TPU integration (#5292)
WoosukKwon Jun 12, 2024
c3c2903
[Bugfix] Add device assertion to TorchSDPA (#5402)
bigPYJ1151 Jun 12, 2024
8b82a89
[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default sof…
khluu Jun 12, 2024
5985e34
[Kernel] Vectorized FP8 quantize kernel (#5396)
comaniac Jun 12, 2024
5cc50a5
[Bugfix] TYPE_CHECKING for MultiModalData (#5444)
kimdwkimdw Jun 12, 2024
51602ee
[Frontend] [Core] Support for sharded tensorized models (#4990)
tjohnson31415 Jun 12, 2024
622d451
[misc] add hint for AttributeError (#5462)
youkaichao Jun 12, 2024
b8d4dff
[Doc] Update debug docs (#5438)
DarkLight1337 Jun 12, 2024
94a07bb
[Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470)
mgoin Jun 12, 2024
7d19de2
[Frontend] Add "input speed" to tqdm postfix alongside output speed (…
mgoin Jun 12, 2024
2135cac
[Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451)
Isotr0py Jun 12, 2024
ea3890a
[Core][Distributed] code deduplication in tp&pp with coordinator(#5293)
youkaichao Jun 13, 2024
916d219
[ci] Use sccache to build images (#5419)
khluu Jun 13, 2024
8840753
[Bugfix]if the content is started with ":"(response of ping), client …
sywangyi Jun 13, 2024
c2637a6
[Kernel] `w4a16` support for `compressed-tensors` (#5385)
dsikka Jun 13, 2024
23ec72f
[CI/Build][REDO] Add is_quant_method_supported to control quantizatio…
mgoin Jun 13, 2024
bd43973
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497)
wenyujin333 Jun 13, 2024
80aa7e9
[Hardware][Intel] Optimize CPU backend and add more performance tips …
bigPYJ1151 Jun 13, 2024
a65634d
[Docs] Add 4th meetup slides (#5509)
WoosukKwon Jun 13, 2024
03dccc8
[Misc] Add vLLM version getter to utils (#5098)
DarkLight1337 Jun 13, 2024
3987347
[CI/Build] Simplify OpenAI server setup in tests (#5100)
DarkLight1337 Jun 13, 2024
0ce7b95
[Doc] Update LLaVA docs (#5437)
DarkLight1337 Jun 13, 2024
85657b5
[Kernel] Factor out epilogues from cutlass kernels (#5391)
tlrmchlsmth Jun 13, 2024
30299a4
[MISC] Remove FP8 warning (#5472)
comaniac Jun 13, 2024
a8fda4f
Seperate dev requirements into lint and test (#5474)
Yard1 Jun 13, 2024
6b0511a
Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478)
Yard1 Jun 13, 2024
1696efe
[misc] fix format.sh (#5511)
youkaichao Jun 13, 2024
33e3b37
[CI/Build] Disable test_fp8.py (#5508)
tlrmchlsmth Jun 13, 2024
e38042d
[Kernel] Disable CUTLASS kernels for fp8 (#5505)
tlrmchlsmth Jun 13, 2024
50eed24
Add `cuda_device_count_stateless` (#5473)
Yard1 Jun 13, 2024
cd9c0d6
[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452)
DamonFool Jun 13, 2024
55d6361
[Misc] Fix arg names in quantizer script (#5507)
AllenDou Jun 14, 2024
0f0d8bc
bump version to v0.5.0.post1 (#5522)
simon-mo Jun 14, 2024
319ad7f
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs…
KuntaiDu Jun 14, 2024
d47af2b
[CI/Build] Disable LLaVA-NeXT CPU test (#5529)
DarkLight1337 Jun 14, 2024
703475f
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516)
tlrmchlsmth Jun 14, 2024
d74674b
[Misc] Fix arg names (#5524)
AllenDou Jun 14, 2024
1598568
[ Misc ] Rs/compressed tensors cleanup (#5432)
robertgshaw2-neuralmagic Jun 14, 2024
348616a
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401)
tlrmchlsmth Jun 14, 2024
48f589e
[mis] fix flaky test of test_cuda_device_count_stateless (#5546)
youkaichao Jun 14, 2024
77490c6
[Core] Remove duplicate processing in async engine (#5525)
DarkLight1337 Jun 14, 2024
d1c3d7d
[misc][distributed] fix benign error in `is_in_the_same_node` (#5512)
youkaichao Jun 14, 2024
cdab68d
[Docs] Add ZhenFund as a Sponsor (#5548)
simon-mo Jun 14, 2024
6e2527a
[Doc] Update documentation on Tensorizer (#5471)
sangstar Jun 14, 2024
e2afb03
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460)
tdoublep Jun 14, 2024
28c145e
[Bugfix] Fix typo in Pallas backend (#5558)
WoosukKwon Jun 14, 2024
f5bb85b
[Core][Distributed] improve p2p cache generation (#5528)
youkaichao Jun 14, 2024
bd7efe9
Add ccache to amd (#5555)
simon-mo Jun 15, 2024
1b8a0d7
[Core][Bugfix]: fix prefix caching for blockv2 (#5364)
leiwen83 Jun 15, 2024
0e9164b
[mypy] Enable type checking for test directory (#5017)
DarkLight1337 Jun 15, 2024
81fbb36
[CI/Build] Test both text and token IDs in batched OpenAI Completions…
DarkLight1337 Jun 15, 2024
e691918
[misc] Do not allow to use lora with chunked prefill. (#5538)
rkooo567 Jun 15, 2024
d919ecc
add gptq_marlin test for bug report https://github.com/vllm-project/v…
alexm-neuralmagic Jun 15, 2024
1c0afa1
[BugFix] Don't start a Ray cluster when not using Ray (#5570)
njhill Jun 15, 2024
3ce2c05
[Fix] Correct OpenAI batch response format (#5554)
zifeitong Jun 15, 2024
f31c1f9
Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518)
Yard1 Jun 16, 2024
4a67690
[CI][BugFix] Flip is_quant_method_supported condition (#5577)
mgoin Jun 16, 2024
f07d513
[build][misc] limit numpy version (#5582)
youkaichao Jun 16, 2024
845a3f2
[Doc] add debugging tips for crash and multi-node debugging (#5581)
youkaichao Jun 17, 2024
e2b85cf
Fix w8a8 benchmark and add Llama-3-8B (#5562)
comaniac Jun 17, 2024
9333fb8
[Model] Rename Phi3 rope scaling type (#5595)
garg-amit Jun 17, 2024
9e74d9d
Correct alignment in the seq_len diagram. (#5592)
CharlesRiggins Jun 17, 2024
890d8d9
[Kernel] `compressed-tensors` marlin 24 support (#5435)
dsikka Jun 17, 2024
1f12122
[Misc] use AutoTokenizer for benchmark serving when vLLM not installe…
zhyncs Jun 17, 2024
728c4c8
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
jikunshang Jun 17, 2024
ab66536
[CI/BUILD] Support non-AVX512 vLLM building and testing (#5574)
DamonFool Jun 17, 2024
9e4e6fe
[CI] the readability of benchmarking and prepare for dashboard (#5571)
KuntaiDu Jun 17, 2024
1b44aaf
[bugfix][distributed] fix 16 gpus local rank arrangement (#5604)
youkaichao Jun 17, 2024
e441bad
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584)
youkaichao Jun 17, 2024
a3e8a05
[Bugfix] Fix KV head calculation for MPT models when using GQA (#5142)
bfontain Jun 17, 2024
26e1188
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606)
zifeitong Jun 17, 2024
fa9e385
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of…
sroy745 Jun 18, 2024
daef218
[Model] Initialize Phi-3-vision support (#4986)
Isotr0py Jun 18, 2024
5002175
[Kernel] Add punica dimensions for Granite 13b (#5559)
joerunde Jun 18, 2024
8eadcf0
[misc][typo] fix typo (#5620)
youkaichao Jun 18, 2024
32c86e4
[Misc] Fix typo (#5618)
DarkLight1337 Jun 18, 2024
114d727
[CI] Avoid naming different metrics with the same name in performance…
KuntaiDu Jun 18, 2024
db5ec52
[bugfix][distributed] improve p2p capability test (#5612)
youkaichao Jun 18, 2024
f0cc0e6
[Misc] Remove import from transformers logging (#5625)
CatherineSue Jun 18, 2024
4ad7b53
[CI/Build][Misc] Update Pytest Marker for VLMs (#5623)
ywang96 Jun 18, 2024
13db436
[ci] Deprecate original CI template (#5624)
khluu Jun 18, 2024
7879f24
[Misc] Add OpenTelemetry support (#4687)
ronensc Jun 18, 2024
95db455
[Misc] Add channel-wise quantization support for w8a8 dynamic per tok…
dsikka Jun 18, 2024
19091ef
[ci] Setup Release pipeline and build release wheels with cache (#5610)
khluu Jun 18, 2024
07feecd
[Model] LoRA support added for command-r (#5178)
sergey-tinkoff Jun 18, 2024
8a17338
[Bugfix] Fix for inconsistent behaviour related to sampling and repet…
tdoublep Jun 18, 2024
2bd231a
[Doc] Added cerebrium as Integration option (#5553)
milo157 Jun 18, 2024
b23ce92
[Bugfix] Fix CUDA version check for mma warning suppression (#5642)
tlrmchlsmth Jun 18, 2024
6820724
[Bugfix] Fix w8a8 benchmarks for int8 case (#5643)
tlrmchlsmth Jun 19, 2024
59a1eb5
[Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628)
ShukantPal Jun 19, 2024
e5150f2
[Bugfix] Added test for sampling repetition penalty bug. (#5659)
tdoublep Jun 19, 2024
f758aed
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate…
hongxiayang Jun 19, 2024
3eea748
[misc][distributed] use 127.0.0.1 for single-node (#5619)
youkaichao Jun 19, 2024
da971ec
[Model] Add FP8 kv cache for Qwen2 (#5656)
mgoin Jun 19, 2024
7d46c8d
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684)
Isotr0py Jun 19, 2024
d871453
[Misc]Add param max-model-len in benchmark_latency.py (#5629)
DearPlanet Jun 19, 2024
e9c2732
[CI/Build] Add tqdm to dependencies (#5680)
DarkLight1337 Jun 19, 2024
3ee5c4b
[ci] Add A100 queue into AWS CI template (#5648)
khluu Jun 19, 2024
afed90a
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg…
mgoin Jun 19, 2024
d571ca0
[ci][distributed] add tests for custom allreduce (#5689)
youkaichao Jun 19, 2024
7868750
[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654)
zifeitong Jun 19, 2024
e83db9e
[Doc] Update docker references (#5614)
rafvasq Jun 19, 2024
4a30d7e
[Misc] Add per channel support for static activation quantization; up…
dsikka Jun 19, 2024
949e49a
[ci] Limit num gpus if specified for A100 (#5694)
khluu Jun 19, 2024
3730a1c
[Misc] Improve conftest (#5681)
DarkLight1337 Jun 20, 2024
1b2eaac
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703)
ywang96 Jun 20, 2024
111af1f
[Kernel] Update Cutlass int8 kernel configs for SM90 (#5514)
varun-sundar-rabindranath Jun 20, 2024
ad137cd
[Model] Port over CLIPVisionModel for VLMs (#5591)
ywang96 Jun 20, 2024
a7dcc62
[Kernel] Update Cutlass int8 kernel configs for SM80 (#5275)
varun-sundar-rabindranath Jun 20, 2024
3f3b6b2
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS ke…
tlrmchlsmth Jun 20, 2024
8065a7e
[Frontend] Add FlexibleArgumentParser to support both underscore and …
mgoin Jun 20, 2024
6c5b7af
[distributed][misc] use fork by default for mp (#5669)
youkaichao Jun 21, 2024
b12518d
[Model] MLPSpeculator speculative decoding support (#4947)
JRosenkranz Jun 21, 2024
1f56742
[Kernel] Add punica dimension for Qwen2 LoRA (#5441)
jinzhen-lin Jun 21, 2024
c35e4a3
[BugFix] Fix test_phi3v.py (#5725)
CatherineSue Jun 21, 2024
67005a0
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665)
jeejeelee Jun 21, 2024
d9a252b
[Core][Distributed] add shm broadcast (#5399)
youkaichao Jun 21, 2024
bd620b0
[Kernel][CPU] Add Quick `gelu` to CPU (#5717)
ywang96 Jun 21, 2024
8b8fed5
chore: add fork OWNERS
z103cb Apr 30, 2024
5048126
add ubi Dockerfile
dtrifiro May 21, 2024
0264b36
Dockerfile.ubi: remove references to grpc/protos
dtrifiro May 21, 2024
a5047d8
Dockerfile.ubi: use vllm-tgis-adapter
dtrifiro May 28, 2024
955598d
gha: add sync workflow
dtrifiro Jun 3, 2024
119767e
Dockerfile.ubi: use distributed-executor-backend=mp as default
dtrifiro Jun 10, 2024
a82fb14
Dockerfile.ubi: remove vllm-nccl workaround
dtrifiro Jun 13, 2024
cc5d64a
Dockerfile.ubi: add missing requirements-*.txt bind mounts
dtrifiro Jun 18, 2024
f9f3bc7
add triton CustomCacheManger
tdoublep May 29, 2024
40ae5b9
gha: sync-with-upstream workflow create PRs as draft
dtrifiro Jun 19, 2024
5c44d84
add smoke/unit tests scripts
dtrifiro Jun 19, 2024
f722b3e
extras: exit unit tests on err
dtrifiro Jun 20, 2024
5b66f1e
Dockerfile.ubi: misc improvements
dtrifiro May 28, 2024
8720c92
update OWNERS
dtrifiro Jun 21, 2024
eeb6f33
sync release with main @ 8720c92e
dtrifiro Jun 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import zipfile

MAX_SIZE_MB = 150
MAX_SIZE_MB = 200


def print_top_10_largest_files(zip_file):
Expand Down
103 changes: 103 additions & 0 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# vLLM benchmark suite

## Introduction

This directory contains the performance benchmarking CI for vllm.
The goal is to help developers know the impact of their PRs on the performance of vllm.

This benchmark will be *triggered* upon:
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label.

**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models.

**Benchmarking Duration**: about 1hr.

**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run.


## Configuring the workload

The benchmarking workload contains three parts:
- Latency tests in `latency-tests.json`.
- Throughput tests in `throughput-tests.json`.
- Serving tests in `serving-tests.json`.

See [descriptions.md](tests/descriptions.md) for detailed descriptions.

### Latency test

Here is an example of one test inside `latency-tests.json`:

```json
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
},
]
```

In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.


### Throughput test
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.

The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.

### Serving test
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```
[
{
"test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
]
```

Inside this example:
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`

The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.

## Visualizing the results
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
62 changes: 62 additions & 0 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
steps:
- label: "Wait for container to be ready"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
containers:
- image: badouralix/curl-jq
command:
- sh
- .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- wait
- label: "A100 Benchmark"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100: NVIDIA SMI"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# propagate-uid-gid: false
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

27 changes: 27 additions & 0 deletions .buildkite/nightly-benchmarks/kickoff-pipeline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/usr/bin/env bash

# NOTE(simon): this script runs inside a buildkite agent with CPU only access.
set -euo pipefail

# Install system packages
apt update
apt install -y curl jq

# Install minijinja for templating
curl -sSfL https://github.com/mitsuhiko/minijinja/releases/latest/download/minijinja-cli-installer.sh | sh
source $HOME/.cargo/env

# If BUILDKITE_PULL_REQUEST != "false", then we check the PR labels using curl and jq
if [ "$BUILDKITE_PULL_REQUEST" != "false" ]; then
PR_LABELS=$(curl -s "https://api.github.com/repos/vllm-project/vllm/pulls/$BUILDKITE_PULL_REQUEST" | jq -r '.labels[].name')

if [[ $PR_LABELS == *"perf-benchmarks"* ]]; then
echo "This PR has the 'perf-benchmarks' label. Proceeding with the nightly benchmarks."
else
echo "This PR does not have the 'perf-benchmarks' label. Skipping the nightly benchmarks."
exit 0
fi
fi

# Upload sample.yaml
buildkite-agent pipeline upload .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Loading
Loading