Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 05 19 #249

Merged
merged 131 commits into from
Jun 3, 2024
Merged
Changes from 1 commit
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
1337ced
Disable cuda version check in vllm-openai image (#4530)
zhaoyang-star May 5, 2024
7d1afa9
[Bugfix] Fix `asyncio.Task` not being subscriptable (#4623)
DarkLight1337 May 6, 2024
76d1c0a
[CI] use ccache actions properly in release workflow (#4629)
simon-mo May 6, 2024
8c3136e
[CI] Add retry for agent lost (#4633)
cadedaniel May 6, 2024
5749888
Update lm-format-enforcer to 0.10.1 (#4631)
noamgat May 6, 2024
c6f73a2
[Kernel] Make static FP8 scaling more robust (#4570)
pcmoritz May 7, 2024
a542de1
[Core][Optimization] change python dict to pytorch tensor (#4607)
youkaichao May 7, 2024
a3ff2ae
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (#4642)
Alexei-V-Ivanov-AMD May 7, 2024
e4ab5c6
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithL…
FurtherAI May 7, 2024
fd69572
[Core][Optimization] change copy-on-write from dict[int, list] to lis…
youkaichao May 7, 2024
8673ad0
[Bug fix][Core] fixup ngram not setup correctly (#4551)
leiwen83 May 7, 2024
3fc0fa0
[Core][Distributed] support cpu&device in broadcast tensor dict (#4660)
youkaichao May 8, 2024
43bc7e9
[Core] Optimize sampler get_logprobs (#4594)
rkooo567 May 8, 2024
01ad752
[Kernel] Make static FP8 scaling more robust (#4570)
rkooo567 May 8, 2024
f64e4e4
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi …
DefTruth May 8, 2024
e06c2d6
[Misc] Add `get_name` method to attention backends (#4685)
WoosukKwon May 8, 2024
01d4ceb
[Core] Faster startup for LoRA enabled models (#4634)
Yard1 May 8, 2024
8afd8f7
[Core][Optimization] change python dict to pytorch tensor for blocks …
youkaichao May 8, 2024
1fe8d9c
[CI/Test] fix swap test for multi gpu (#4689)
youkaichao May 8, 2024
b5967c4
[Misc] Use vllm-flash-attn instead of flash-attn (#4686)
WoosukKwon May 8, 2024
4a85263
[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)
comaniac May 8, 2024
edd9e90
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…
cadedaniel May 8, 2024
32314e5
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (…
alexm-redhat May 9, 2024
b0d3937
[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)
MahmoudAshraf97 May 9, 2024
294e480
[Frontend] Move async logic outside of constructor (#4674)
DarkLight1337 May 9, 2024
04a0387
[Misc] Remove unnecessary ModelRunner imports (#4703)
WoosukKwon May 9, 2024
fff9c2c
[Misc] Set block size at initialization & Fix test_model_runner (#4705)
WoosukKwon May 9, 2024
396a546
[ROCm] Add support for Punica kernels on AMD GPUs (#3140)
kliuae May 9, 2024
0c85c21
[Bugfix] Fix CLI arguments in OpenAI server docs (#4709)
DarkLight1337 May 9, 2024
631605d
[Bugfix] Update grafana.json (#4711)
robertgshaw2-redhat May 9, 2024
d824ab8
[Bugfix] Add logs for all model dtype casting (#4717)
mgoin May 9, 2024
9b500f3
[Model] Snowflake arctic model implementation (#4652)
sfc-gh-hazhang May 9, 2024
56c100c
[Kernel] [FP8] Improve FP8 linear layer performance (#4691)
pcmoritz May 9, 2024
0b429b8
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
comaniac May 10, 2024
ca3311a
[Core][Distributed] refactor pynccl (#4591)
youkaichao May 10, 2024
4ea25ee
[Misc] Keep only one implementation of the create_dummy_prompt functi…
AllenDou May 10, 2024
cd151e1
chunked-prefill-doc-syntax (#4603)
simon-mo May 10, 2024
4b7644f
[Core]fix type annotation for `swap_blocks` (#4726)
jikunshang May 10, 2024
9aec672
[Misc] Apply a couple g++ cleanups (#4719)
stevegrubb May 10, 2024
65159a8
[Core] Fix circular reference which leaked llm instance in local dev …
rkooo567 May 10, 2024
2fc4bb4
[Bugfix] Fix CLI arguments in OpenAI server docs (#4729)
AllenDou May 10, 2024
f739bdb
[Speculative decoding] CUDA graph support (#4295)
heeju-kim2 May 10, 2024
20b780a
[CI] Nits for bad initialization of SeqGroup in testing (#4748)
robertgshaw2-redhat May 10, 2024
8a9d255
[Core][Test] fix function name typo in custom allreduce (#4750)
youkaichao May 10, 2024
9132d19
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734)
CatherineSue May 11, 2024
18355a9
[Model] Add support for IBM Granite Code models (#4636)
yikangshen May 12, 2024
64367a0
[CI/Build] Tweak Marlin Nondeterminism Issues (#4713)
robertgshaw2-redhat May 13, 2024
fa95832
[CORE] Improvement in ranks code (#4718)
SwapnilDreams100 May 13, 2024
b5c4711
[Core][Distributed] refactor custom allreduce to support multiple tp …
youkaichao May 13, 2024
a92b874
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425)
DarkLight1337 May 13, 2024
270c0c2
[Scheduler] Warning upon preemption and Swapping (#4647)
rkooo567 May 13, 2024
c944527
[Misc] Enhance attention selector (#4751)
WoosukKwon May 13, 2024
7dd2e73
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, u…
sangstar May 13, 2024
61e2bde
[Speculative decoding] Improve n-gram efficiency (#4724)
comaniac May 13, 2024
00d6bd6
[Kernel] Use flash-attn for decoding (#3648)
skrider May 13, 2024
81c2c05
[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793)
pcmoritz May 13, 2024
1d56497
[Doc] Shorten README by removing supported model list (#4796)
zhuohan123 May 13, 2024
2895ae9
[Doc] Add API reference for offline inference (#4710)
DarkLight1337 May 14, 2024
a1f43a0
[Doc] Add meetups to the doc (#4798)
zhuohan123 May 14, 2024
feed62d
[Core][Hash][Automatic Prefix caching] Accelerating the hashing funct…
KuntaiDu May 14, 2024
31c1cd3
[Bugfix][Doc] Fix CI failure in docs (#4804)
DarkLight1337 May 14, 2024
6838a99
[Core] Add MultiprocessingGPUExecutor (#4539)
njhill May 14, 2024
f246252
Add 4th meetup announcement to readme (#4817)
simon-mo May 14, 2024
bd73ad3
Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820)
rkooo567 May 15, 2024
30e935f
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill …
rkooo567 May 15, 2024
e40b747
[CI/Build] Further decouple HuggingFace implementation from ours duri…
DarkLight1337 May 15, 2024
71c459f
[Bugfix] Properly set distributed_executor_backend in ParallelConfig …
zifeitong May 15, 2024
e6bc337
[Doc] Highlight the fourth meetup in the README (#4842)
zhuohan123 May 15, 2024
1b50825
[Frontend] Re-enable custom roles in Chat Completions API (#4758)
DarkLight1337 May 15, 2024
28f56b3
[Frontend] Support OpenAI batch file format (#4794)
wuisawesome May 15, 2024
e88dd2b
[Core] Implement sharded state loader (#4690)
aurickq May 16, 2024
3360031
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
comaniac May 16, 2024
0240ac9
Add marlin unit tests and marlin benchmark script (#4815)
alexm-redhat May 16, 2024
230af21
[Kernel] add bfloat16 support for gptq marlin kernel (#4788)
jinzhen-lin May 16, 2024
3426d29
[docs] Fix typo in examples filename openi -> openai (#4864)
wuisawesome May 16, 2024
de61ba7
[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851)
wuisawesome May 16, 2024
28f605c
[Bugfix] Bypass authorization API token for preflight requests (#4862)
dulacp May 16, 2024
cf4926d
Add GPTQ Marlin 2:4 sparse structured support (#4790)
alexm-redhat May 16, 2024
40ce57a
Add JSON output support for benchmark_latency and benchmark_throughpu…
simon-mo May 16, 2024
273b3fe
[ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845)
hongxiayang May 16, 2024
1589d50
[Core][Distributed] remove graph mode function (#4818)
youkaichao May 16, 2024
3ced8d0
[Misc] remove old comments (#4866)
youkaichao May 16, 2024
1a745a3
[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850)
Silencioo May 16, 2024
7f372fb
[Kernel] Add w8a8 CUTLASS kernels (#4749)
tlrmchlsmth May 16, 2024
69ac7b4
[Bugfix] Fix FP8 KV cache support (#4869)
WoosukKwon May 16, 2024
e4b31f6
Support to serve vLLM on Kubernetes with LWS (#4829)
kerthcet May 16, 2024
3bf9ee0
[Frontend] OpenAI API server: Do not add bos token by default when en…
bofenghuang May 17, 2024
f2b3686
[Build/CI] Extending the set of AMD tests with Regression, Basic Corr…
Alexei-V-Ivanov-AMD May 17, 2024
3b9b8e5
[Bugfix] fix rope error when load models with different dtypes (#4835)
jinzhen-lin May 17, 2024
96e8baa
Sync huggingface modifications of qwen Moe model (#4774)
eigen2017 May 17, 2024
7af0041
[Doc] Update Ray Data distributed offline inference example (#4871)
Yard1 May 17, 2024
b1a73b5
[Bugfix] Relax tiktoken to >= 0.6.0 (#4890)
mgoin May 17, 2024
3bbe65e
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if…
alexeykondrat May 18, 2024
670a8b8
[Lora] Support long context lora (#4787)
rkooo567 May 18, 2024
c79bcb7
[Bugfix][Model] Add base class for vision-language models (#4809)
DarkLight1337 May 19, 2024
7b70de3
./format
May 19, 2024
1689026
added skips to lora long context
May 19, 2024
1e984b1
format
May 19, 2024
9aef71f
added missed files
May 19, 2024
774ba57
updates check_logprobs_close.py
May 19, 2024
ab7274f
fixed tensorizer
May 19, 2024
2c8f45a
skip mosaic in strict correctness test
May 19, 2024
85ec849
format
May 19, 2024
296861b
Merge branch 'main' into upstream-sync-2024-05-19
robertgshaw2-redhat May 22, 2024
688ef6f
skipped sharded state loader
May 22, 2024
4c437ba
Merge branch 'main' into upstream-sync-2024-05-19
robertgshaw2-redhat May 27, 2024
2059e61
skip shared state loader
May 27, 2024
9642aef
updated build test to use 4 nvcc threads by default. We previously, w…
May 28, 2024
2dad479
tweaked to fix benchmark
May 28, 2024
3bdfeb4
updated workflow to run longer
May 28, 2024
3800a1c
Merge branch 'main' into upstream-sync-2024-05-19
robertgshaw2-redhat May 28, 2024
f1199dc
updated skip lists to skip sharded state loader
May 29, 2024
ee7e65a
verified that test multiproc workers is passing locally
May 29, 2024
b73a142
fixed the sampling params issue
May 29, 2024
8225ddd
fixed other sampling_params issue
May 29, 2024
c386e32
Merge branch 'main' into upstream-sync-2024-05-19
May 29, 2024
098e08a
format
May 29, 2024
7d32b8a
confirmed basic correctness test working
May 30, 2024
748d0e1
updated score for marlin 2:4
May 30, 2024
cd648c6
Merge branch 'main' into upstream-sync-2024-05-19
May 30, 2024
9785c41
Disable flaky marlin model
dbarbuzzi May 30, 2024
3507552
Increase benchmark server timeout to 15 minutes
dbarbuzzi May 30, 2024
1d6af5a
Merge branch 'main' into upstream-sync-2024-05-19
robertgshaw2-redhat May 30, 2024
96fbf17
Merge branch 'main' into upstream-sync-2024-05-19
robertgshaw2-redhat May 30, 2024
db69b5c
reduce number of prompts and models in basic server correctness
Jun 1, 2024
0654a43
Merge branch 'nm-vllm-main' into upstream-sync-2024-05-19
Jun 1, 2024
3ba575c
fixed workflows
Jun 1, 2024
43c0adc
removed basic server correctness from release
Jun 2, 2024
50ac573
Update test_compressed.py
robertgshaw2-redhat Jun 2, 2024
1802833
Update test_compressed.py (#277)
robertgshaw2-redhat Jun 2, 2024
2c52fee
nit in setup.py
Jun 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge branch 'nm-vllm-main' into upstream-sync-2024-05-19
Robert Shaw committed Jun 1, 2024
commit 0654a43ca4843861d1db541940fa5b99499f98a2

This merge commit was added into this branch cleanly.

There are no new changes to show, but you can still view the diff.