Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: From SequenceGroup-native code to Sequence-native code #7116

Open
youkaichao opened this issue Aug 4, 2024 · 0 comments
Open
Labels
performance Performance-related issues

Comments

@youkaichao
Copy link
Member

Proposal to improve performance

We have two concepts in vLLM:

  • SequenceGroup, a group of sequence, that originates from the same request. In most usecases, a sequence group contains only one sequence. In parallel sampling, a request can fork into many sequences, depending on the sampling parameter n. In beam search, sequences in the sequence group can change, grow, die.
  • Sequence, consists of a sequence seen by the inference engine. It has prompt, generated tokens, kv cache...

In order to support diverse sampling algorithms, vLLM currently takes a SequenceGroup-native approach: many functions operate in the SequenceGroup-level, e.g. prepare_input takes in a list of SequenceGroup.

The problem is, many functions in an inference engine, naturally fit into Sequence-level operations. For example, when we talk about the batchsize for decoding, it is the number of Sequence we are running for decoding, not the number of SequenceGroup.

To fill in the gap, there are many code in vLLM, that receives SequenceGroup, and unpack the SequenceGroup into Sequence for further operations. Notably, prepare input:

input_tokens = flatten_2d_lists([
flatten_2d_lists(inter_data.input_tokens)
for inter_data in self.inter_data_list
])

This turns out to be very inefficient, makes the code difficult to read/maintain.

To have a rough impression about how inefficient these conversion can be, take a look at #7051 , where simply removing some get_seqs call in SequenceGroup, can lead to 1% end-to-end throughput gain.

Per the discussion in #6226 , we will not directly drop beam search support. However, we should figure out a way to support it, without hurting the performance of majority usecase.

The proposal I want to discuss, is to move the vLLM code into a Sequence-native approach. It is inspired by the lightllm approach:

  • each request will have a request id, a sequence group id
  • a sequence in the sequence group, will have a sequence group id, and a sequence id
  • there will be a global mapping Dict[int, List[int]], maps the sequence group id to the ids of sequences inside the group, only for a sequence group with parallel sampling or beam search

All functions that operate on the Sequence level (mainly the model runner part), will natively receive a list of Sequence. They don't need to unpack SequenceGroup any more.

For some functions that operate on the SequenceGroup level (mainly the scheduler logic for gang-scheduling a sequence group, and the output processor logic that creates/removes sequence in the group), they have to reconstruct the sequence group from given list of sequence, leveraging the global mapping. Note that, an important optimization, is we can skip all the sequence group logic, when we find the global mapping is empty, meaning that we don't have any parallel sampling or beam search.

When we do have parallel sampling or beam search, this will incur some performance drop. However, with the greatly simplified code in the model runner, we can expect the other part of vLLM can be greatly accelerated. Therefore, beam search or parallel sampling can also be faster in the end of the day.

An example benefit, is that this function can be greatly simplified ( we can return early):

def _process_sequence_group_outputs(self, seq_group: SequenceGroup,

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant