[Performance]: From SequenceGroup-native code to Sequence-native code #7116

youkaichao · 2024-08-04T00:54:31Z

Proposal to improve performance

We have two concepts in vLLM:

SequenceGroup, a group of sequence, that originates from the same request. In most usecases, a sequence group contains only one sequence. In parallel sampling, a request can fork into many sequences, depending on the sampling parameter n. In beam search, sequences in the sequence group can change, grow, die.
Sequence, consists of a sequence seen by the inference engine. It has prompt, generated tokens, kv cache...

In order to support diverse sampling algorithms, vLLM currently takes a SequenceGroup-native approach: many functions operate in the SequenceGroup-level, e.g. prepare_input takes in a list of SequenceGroup.

The problem is, many functions in an inference engine, naturally fit into Sequence-level operations. For example, when we talk about the batchsize for decoding, it is the number of Sequence we are running for decoding, not the number of SequenceGroup.

To fill in the gap, there are many code in vLLM, that receives SequenceGroup, and unpack the SequenceGroup into Sequence for further operations. Notably, prepare input:

vllm/vllm/worker/model_runner.py

Lines 507 to 510 in 825b044

    
           input_tokens = flatten_2d_lists([ 
        
               flatten_2d_lists(inter_data.input_tokens) 
        
               for inter_data in self.inter_data_list 
        
           ])

This turns out to be very inefficient, makes the code difficult to read/maintain.

To have a rough impression about how inefficient these conversion can be, take a look at #7051 , where simply removing some get_seqs call in SequenceGroup, can lead to 1% end-to-end throughput gain.

Per the discussion in #6226 , we will not directly drop beam search support. However, we should figure out a way to support it, without hurting the performance of majority usecase.

The proposal I want to discuss, is to move the vLLM code into a Sequence-native approach. It is inspired by the lightllm approach:

each request will have a request id, a sequence group id
a sequence in the sequence group, will have a sequence group id, and a sequence id
there will be a global mapping Dict[int, List[int]], maps the sequence group id to the ids of sequences inside the group, only for a sequence group with parallel sampling or beam search

All functions that operate on the Sequence level (mainly the model runner part), will natively receive a list of Sequence. They don't need to unpack SequenceGroup any more.

For some functions that operate on the SequenceGroup level (mainly the scheduler logic for gang-scheduling a sequence group, and the output processor logic that creates/removes sequence in the group), they have to reconstruct the sequence group from given list of sequence, leveraging the global mapping. Note that, an important optimization, is we can skip all the sequence group logic, when we find the global mapping is empty, meaning that we don't have any parallel sampling or beam search.

When we do have parallel sampling or beam search, this will incur some performance drop. However, with the greatly simplified code in the model runner, we can expect the other part of vLLM can be greatly accelerated. Therefore, beam search or parallel sampling can also be faster in the end of the day.

An example benefit, is that this function can be greatly simplified ( we can return early):

vllm/vllm/engine/output_processor/single_step.py

Line 82 in 825b044

def _process_sequence_group_outputs(self, seq_group: SequenceGroup,

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

The text was updated successfully, but these errors were encountered:

youkaichao added the performance Performance-related issues label Aug 4, 2024

youkaichao mentioned this issue Aug 4, 2024

[core][misc] simply output processing with shortcut for non-parallel sampling and non-beam search usecase #7117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: From SequenceGroup-native code to Sequence-native code #7116

[Performance]: From SequenceGroup-native code to Sequence-native code #7116

youkaichao commented Aug 4, 2024

[Performance]: From SequenceGroup-native code to Sequence-native code #7116

[Performance]: From SequenceGroup-native code to Sequence-native code #7116

Comments

youkaichao commented Aug 4, 2024

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)