[Performance]: Profile & optimize the BlockManagerV2 #4536

cadedaniel · 2024-05-01T18:27:30Z

Proposal to improve performance

We've recently rewritten the block management subsystem for better testability. We need to profile it under real load to make sure it is performant enough to replace the block manager V1, and fix any issues.

We should do this once the block manager v2 is feature complete (still missing a few items).

Known issue:

Prefix caching num_total_tokens is O(N^2) instead of O(N) (see [Core] Enable prefix caching with block manager v2 enabled #4142 (comment))

The text was updated successfully, but these errors were encountered:

cadedaniel · 2024-06-12T20:51:11Z

What we want to profile:
For low-latency use case:

Batch size of 8-16 range
Various block sizes (16, 32, 128)
Sequence length (long context, 1.5k). Can set num_output_tokens=50.
For spec decode, also num_lookahead_tokens > 0. Try num_lookahead_tokens=5 (what is lookahead scheduling)

For high-throughput use-case:

Batch size up to 256
Various block sizes (16, 32, 128)
Sequence length (long context, 1.5k). Can set num_output_tokens=50.

Other cases that are important (perhaps we make separate tasks):

P0 prefix caching
P1 Beam search
P1 swapping
P1 sliding window

In terms of how to profile, use benchmark_latency + torch profiling (or can use CPU profiler of your choosing)

vllm/benchmarks/benchmark_latency.py

Lines 178 to 187 in c3c2903

    
           parser.add_argument( 
        
               '--profile', 
        
               action='store_true', 
        
               help='profile the generation process of a single batch') 
        
           parser.add_argument( 
        
               '--profile-result-dir', 
        
               type=str, 
        
               default=None, 
        
               help=('path to save the pytorch profiler output. Can be visualized ' 
        
                     'with ui.perfetto.dev or Tensorboard.'))

cadedaniel · 2024-06-12T20:55:13Z

@robertgshaw2-neuralmagic can you assign Alex

cadedaniel added the performance Performance-related issues label May 1, 2024

This was referenced May 1, 2024

[Tracking issue] [Help wanted]: Deprecate BlockManagerV1 #4537

Open

[Usage]: doubt on computational complexity #4620

Closed

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Profile & optimize the BlockManagerV2 #4536

[Performance]: Profile & optimize the BlockManagerV2 #4536

cadedaniel commented May 1, 2024 •

edited

Loading

cadedaniel commented Jun 12, 2024

cadedaniel commented Jun 12, 2024

[Performance]: Profile & optimize the BlockManagerV2 #4536

[Performance]: Profile & optimize the BlockManagerV2 #4536

Comments

cadedaniel commented May 1, 2024 • edited Loading

Proposal to improve performance

cadedaniel commented Jun 12, 2024

cadedaniel commented Jun 12, 2024

cadedaniel commented May 1, 2024 •

edited

Loading