[Roadmap] vLLM Development Roadmap: H2 2023 #244

zhuohan123 · 2023-06-25T17:39:15Z

We summarize the issues we received and our planned features in this issue. This issue will keep being updated.

Latest issue tracked: #677

Software Quality

Code formater Add code formatting script & Add CI to check code format #57
Tests for model correctness Add tests for models #101
Tests for samplers Add tests for sampler #108
Pypi CD Add CD to PyPI #97
CI

Installation

CUDA version Build failure due to CUDA version mismatch #129
Pre-built CUDA Wheels Publish wheels with pre-built CUDA binaries #139 Request for creation of a wheel for vllm #695
Support ROCM Installing with ROCM #621
Windows/WSL installation Bug: Windows installation #179 WSL Ubuntu installation issue #192
H100 Add support for H100 #199 RuntimeError: attn_bias is not correctly aligned #407
Support CUDA 12 cuda 12 #385
Dockerfile feature request: Dockerfile #390
All other issues with Installation label

Documentation

Documentation CD
Documentation on LLMEngine and AsyncLLMEngine
Documentation on user interfaces and the APIs How to set ParallelConfig and SchedulerConfig? #361 Where is the API reference? #395
Documentation on distributed execution Documentation on distributed execution #206 When can I support multi graphics cards? #228 model parallelism #243 Multi-GPU inference and Specify which GPUs to be used during inference #250 多gpus如何使用？ #581
More detailed guide on adding a new model (possibly simplification in code). Especially how to modify the forward function. How integrate with hf with minial modification? #242
Include latency benchmark results.
On memory usage. Question regarding the nearly double GPU memory consumption. #241 GPU consumption #550
How to specify which GPU to use How to specify which gpu to use? #691 os.environ['CUDA_VISIBLE_DEVICES'] = '1' does not work in jupyter #571

New Models

Decoder-only models

BLOOM Support BLOOM #61
Falcon Support for Falcon-7B / 40B models #195 Any plans to support Falcon? #197 Anyone adapting falcon 40B&7B models now? #356
GPT-J Add support for GPTJ #198
MPT Support for MPT-7B and MPT-30B #218 feature request: support mpt-30b #332
LongChat Support for longchat-7b-16k #358
Baichuan-7B why not support baichuan-7b? #303 baichuan-7b return value of apiserver is garbled #400 Support for baichuan models #428
Baichuan-13B
LLaMA-2 Support LLaMA-2 #501

Encoder-decoder models

Other techniques:

Quantized models: see Kernels/Quantized PagedAttention
LoRA: Would it be possible to support LoRA fine-tuned models? #182
Multi-modal models: [Question] Usage with Multimodal LLM #307

Frontend Features

vLLM demo frontends:

List of inputs as OpenAI input Langchain passes prompt as a list instead of str #186 Possibility of Passing Prompts as List[str] to AsyncEngine.generate() #279
Echo Implementing Echo in OpenAI endpoint #201
Support ChatCompletion Endpoint Support ChatCompletion Endpoint in OpenAI demo server #311
Use soft embeddings as input does vicuna support embedding input? #369
Support logit_bias [Feature] Add support for logit_bias #379 I want use the function prefix_allowed_tokens_fn of huggingface model.generate(), where of vllm's source code shall I modify? #415
User-defined conversation template feature request: Support user-defined conversation template #408
Specify GPU to run on How to specify which GPU the model inference on? #352 Specify GPUs bug (torch.distributed.all_reduce(torch.zeros(1).cuda())) #470

Integration with other frontends:

FastChat (merged)
Ray Serve (merged)
NVIDIA Triton NVIDIA Triton support #541
SkyPilot
LangChain (Support from LangChain) LangChain and LlamaIndex support #233 Langchain passes prompt as a list instead of str #186 Langchain/LLAMA_INDEX #553

Engine Optimization and New Features

Kernels

Multi-query attention How does this compare to MQA (multi-query attention)? #169
PagedAttention kernel with multiple query positions. Fix the rushed out multi-query kernel #44
Quantized PagedAttention GPTQ / Quantization support? #174 What is the correct way to use quantized versions of vicuna or guanco? #210 8-bit quantization support #214 Not able to used qlora models with vllm #252 8bit support #295 support for quantized models? #316 Loading quantized models #392
Sampling kernels Implement custom kernels for top-k and top-p sampling #125 Question about sampler. It takes too much time #249
Condensed RotaryEmbeddings Support for Condensed RotaryEmbeddings #333 supporting superhot models? #388 RoPE scaling support? #464 Request: NTK rope support #479 Does vllm support vicuna-13b-v1.5-16k ? #674 Add AliBi context scaling into vllm for Baichuan13B #686
Flash Attention V2 Flash Attention V2 #485
FP8 Kernel TE FP8 support? #448

Bugs

Floating point comparison Dangerous floating point comparison #71
Check input length Check whether the input request is too long #113 Prompt size limits? It keeps hanging with prompts longer than 120 tokens #276 Long context will cause the vLLM stop #286 scheduler max-length #447
Do not init process groups when using a single GPU Do not initialize process group when using a single GPU #117 How to initialize two LLMs in one service？ #565 Running two different models on the same machine #654
Ray tensor parallel bugs ray OOM in tensor parallel #322 Stuck while inferring with WizardCoder model #366 [MPT-30B] OutOfMemoryError: CUDA out of memory #372 Cuda failure 'peer access is not supported between these two devices' #406
Performance comparison with TGI TGI performance is better than vllm on A800 #262 higher latency than TGI #335 Outdated benchmarks #381
All other issues with Bug label

The text was updated successfully, but these errors were encountered:

zjc17 · 2023-07-18T10:08:33Z

Is the Quantized models Supporting under developing?

WaterKnight1998 · 2023-07-20T22:32:19Z

Is the Quantized models Supporting under developing?

This would be very helpfull @zhuohan123. Thank you very much for the state of the art performance in inference!

Jwdev-wr · 2023-08-13T02:50:11Z

Can we get function calling to match openai api feature on the roadmap? Not entirely sure what the implementation for that looks like, but it's a very useful feature.

mondaychen · 2023-08-22T20:38:56Z

I have a prototype implementation of OpenAI-like function calling. It works well on advanced models (like Llama 2). Please let me know if this is something the team would consider taking in as part of vllm.

zhisbug · 2023-08-23T18:25:45Z

@mondaychen Yes, how about you submit a PR?

mondaychen · 2023-08-24T17:37:08Z

@zhisbug OK! I'll polish my prototype and submit a PR

boxter007 · 2023-09-06T12:09:24Z

Need to support Baichuan2

yeahjack · 2023-09-08T02:53:13Z

Here is an implementation of function calling with huggingface's models, could be helpful: https://local-llm-function-calling.readthedocs.io/en/latest/quickstart.html

Xu-Chen · 2023-09-26T15:19:28Z

Need to support Qwen-14b

SinclairCoder · 2023-10-13T08:55:10Z

Need to support Phi-1 and Phi-1.5

xiaotiancd · 2023-11-01T05:39:24Z

Possible to support CPU too?

zhouyuan · 2023-11-02T09:16:31Z

Possible to support CPU too?
Hi @xiaotiancd Here's one draft patch to support CPU based infer, in case you are interested.
#1028

-yuan

usaxena-asapp · 2023-11-09T01:24:39Z

Hey @zhouyuan @WoosukKwon, I'd like to get this new variant of concurrent-LORA serving added to the roadmap:

concurrent LORA serving:

jens-create · 2023-11-16T09:24:10Z

Is there any plans to support functions like OpenAI? I know this task is complex as the parsing of llm output will be custom for each fine-tuned model depending on the training data. However, perhaps it would be possible to add a module/function that you can inject into api_server.py that maps the output of the llm (output.text) to a ChatMessage.

For example functionary has copied some of vllm and extended/customised it to support functions

In the future, when hopefully, more open-source models with function calling capabilities are released, it would be great if one does not have to clone a repository for each model but instead if the particular parsing was supported by vllm.

What thoughts are there on this matter? I wouldn't mind contributing to such a feature...

OleksandrKorovii · 2023-12-05T11:59:16Z

Is there any plans to support functions like OpenAI? I know this task is complex as the parsing of llm output will be custom for each fine-tuned model depending on the training data. However, perhaps it would be possible to add a module/function that you can inject into api_server.py that maps the output of the llm (output.text) to a ChatMessage.

For example functionary has copied some of vllm and extended/customised it to support functions

In the future, when hopefully, more open-source models with function calling capabilities are released, it would be great if one does not have to clone a repository for each model but instead if the particular parsing was supported by vllm.

What thoughts are there on this matter? I wouldn't mind contributing to such a feature...

Also interesting in this question

zhuohan123 · 2024-01-31T06:32:57Z

We have deprecated this roadmap. Please find our latest roadmap in #2681

zhuohan123 pinned this issue Jun 25, 2023

929359291 mentioned this issue Jun 27, 2023

Why vllm does not support Chinese input #246

Closed

BrightXiaoHan mentioned this issue Jul 7, 2023

请问还会有后续开发吗？ BrightXiaoHan/fast-chatglm#3

Open

zhyncs mentioned this issue Jul 21, 2023

NVIDIA Triton support #541

Closed

lbeurerkellner mentioned this issue Aug 1, 2023

Integrate vLLM eth-sri/lmql#143

Open

Symbolk mentioned this issue Aug 15, 2023

GPTQ / Quantization support? #174

Closed

This was referenced Aug 16, 2023

T5 model support #404

Closed

T5 like encoder-decoder model support #668

Closed

EricLingRui mentioned this issue Aug 25, 2023

Qwen-7B, set max_num_batched_tokens=4096, then output role is null and content is empty string #809

Closed

wangcx18 mentioned this issue Aug 25, 2023

Add GLM-10B-Chinese support #868

Closed

LinPoly mentioned this issue Aug 30, 2023

Would RPyC be a good alternative of Ray? #908

Closed

AmoghM mentioned this issue Sep 4, 2023

Integrate Speculative decoding to speed up inferences #942

Closed

litone01 mentioned this issue Sep 11, 2023

[DOC] Add additional comments for LLMEngine and AsyncLLMEngine #1011

Merged

2 tasks

This was referenced Sep 23, 2023

support whisper? #1152

Closed

vllm can't run peft model? #1129

Open

This was referenced Sep 28, 2023

JSON formatting issue #1191

Closed

Phi 1.5 support #1167

Closed

jeejeelee mentioned this issue Jan 19, 2024

[FIX]avoid initialize process group when using a single GPU #2496

Closed

simon-mo unpinned this issue Jan 26, 2024

zhuohan123 mentioned this issue Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

zhuohan123 changed the title ~~vLLM Development Roadmap~~ [Deprecated] vLLM Development Roadmap Jan 31, 2024

zhuohan123 closed this as completed Jan 31, 2024

duanzhaol mentioned this issue Mar 11, 2024

What's up with Pipeline Parallelism? #3314

Open

taoluo mentioned this issue Mar 22, 2024

Add Sarathi-Serve support in vLLM #3121

Draft

simon-mo mentioned this issue Apr 4, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024

Add FP8 and marlin 2:4 tests for lm-eval (vllm-project#244)

59cf939

simon-mo mentioned this issue Oct 1, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

39 tasks

simon-mo changed the title ~~[Deprecated] vLLM Development Roadmap~~ [Roadmap] vLLM Development Roadmap: H2 2023 Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] vLLM Development Roadmap: H2 2023 #244

[Roadmap] vLLM Development Roadmap: H2 2023 #244

zhuohan123 commented Jun 25, 2023 •

edited by simon-mo

Loading

zjc17 commented Jul 18, 2023

WaterKnight1998 commented Jul 20, 2023

Jwdev-wr commented Aug 13, 2023

mondaychen commented Aug 22, 2023

zhisbug commented Aug 23, 2023

mondaychen commented Aug 24, 2023

boxter007 commented Sep 6, 2023

yeahjack commented Sep 8, 2023

Xu-Chen commented Sep 26, 2023

SinclairCoder commented Oct 13, 2023

xiaotiancd commented Nov 1, 2023

zhouyuan commented Nov 2, 2023

usaxena-asapp commented Nov 9, 2023

jens-create commented Nov 16, 2023

OleksandrKorovii commented Dec 5, 2023

zhuohan123 commented Jan 31, 2024

[Roadmap] vLLM Development Roadmap: H2 2023 #244

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Comments

zhuohan123 commented Jun 25, 2023 • edited by simon-mo Loading

Software Quality

Installation

Documentation

New Models

Frontend Features

Engine Optimization and New Features

Kernels

Bugs

zjc17 commented Jul 18, 2023

WaterKnight1998 commented Jul 20, 2023

Jwdev-wr commented Aug 13, 2023

mondaychen commented Aug 22, 2023

zhisbug commented Aug 23, 2023

mondaychen commented Aug 24, 2023

boxter007 commented Sep 6, 2023

yeahjack commented Sep 8, 2023

Xu-Chen commented Sep 26, 2023

SinclairCoder commented Oct 13, 2023

xiaotiancd commented Nov 1, 2023

zhouyuan commented Nov 2, 2023

usaxena-asapp commented Nov 9, 2023

jens-create commented Nov 16, 2023

OleksandrKorovii commented Dec 5, 2023

zhuohan123 commented Jan 31, 2024

zhuohan123 commented Jun 25, 2023 •

edited by simon-mo

Loading