Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor #3431

Merged
merged 9 commits into from
Mar 21, 2024

Conversation

ElizaWszola
Copy link
Contributor

@ElizaWszola ElizaWszola commented Mar 15, 2024

Make the evictor based on OrderedDict rather than Dict, so we obtain faster eviction without damaging the performance of adding and removing PhysicalTokenBlocks. This lets us make the gap between the cached and uncached runtime smaller in throughput benchmarks.

Results of python benchmark_throughput_cache.py --backend vllm --model huggyllama/llama-7b --dataset ../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 (10 runs each):

OrderedDict-based Evictor (this PR)
Throughput: 8.42 requests/s, 4122.29 tokens/s
Throughput: 8.45 requests/s, 4137.79 tokens/s
Throughput: 8.46 requests/s, 4141.09 tokens/s
Throughput: 8.48 requests/s, 4154.49 tokens/s
Throughput: 8.48 requests/s, 4153.40 tokens/s
Throughput: 8.50 requests/s, 4161.52 tokens/s
Throughput: 8.52 requests/s, 4174.35 tokens/s
Throughput: 8.53 requests/s, 4175.47 tokens/s
Throughput: 8.53 requests/s, 4175.94 tokens/s
Throughput: 8.56 requests/s, 4192.30 tokens/s

Dict-based Evictor (old)
Throughput: 8.25 requests/s, 4040.61 tokens/s
Throughput: 8.34 requests/s, 4085.23 tokens/s
Throughput: 8.26 requests/s, 4045.35 tokens/s
Throughput: 8.27 requests/s, 4049.02 tokens/s
Throughput: 8.32 requests/s, 4075.07 tokens/s
Throughput: 8.26 requests/s, 4043.52 tokens/s
Throughput: 8.25 requests/s, 4038.19 tokens/s
Throughput: 8.34 requests/s, 4082.67 tokens/s
Throughput: 8.17 requests/s, 3998.23 tokens/s
Throughput: 8.24 requests/s, 4034.38 tokens/s

No prefix caching (with improvements from PR #3357)
Throughput: 8.50 requests/s, 4163.69 tokens/s
Throughput: 8.54 requests/s, 4183.18 tokens/s
Throughput: 8.56 requests/s, 4193.75 tokens/s
Throughput: 8.57 requests/s, 4198.46 tokens/s
Throughput: 8.60 requests/s, 4211.83 tokens/s
Throughput: 8.63 requests/s, 4228.16 tokens/s
Throughput: 8.67 requests/s, 4247.58 tokens/s
Throughput: 8.69 requests/s, 4253.06 tokens/s
Throughput: 8.75 requests/s, 4286.50 tokens/s
Throughput: 8.78 requests/s, 4296.93 tokens/s

Results of running benchmarks from https://github.com/neuralmagic/nm-vllm/pull/102/files

with --enable-prefix-caching:

Successful requests: 2000
Benchmark duration: 809.242636 s
Total input tokens: 491513
Total generated tokens: 325079
Request throughput: 2.47 requests/s
Input token throughput: 607.37 tokens/s
Output token throughput: 401.71 tokens/s
Mean TTFT: 38.35 ms
Median TTFT: 29.28 ms
P99 TTFT: 95.89 ms
Mean TPOT: 14.23 ms
Median TPOT: 14.05 ms
P99 TPOT: 17.75 ms

without --enable-prefix-caching:

Successful requests: 2000
Benchmark duration: 809.252614 s
Total input tokens: 491513
Total generated tokens: 325096
Request throughput: 2.47 requests/s
Input token throughput: 607.37 tokens/s
Output token throughput: 401.72 tokens/s
Mean TTFT: 38.37 ms
Median TTFT: 29.02 ms
P99 TTFT: 95.69 ms
Mean TPOT: 14.16 ms
Median TPOT: 14.01 ms
P99 TPOT: 17.67 ms

New Sonnet Dataset results with huggyllama/llama-7b model:

(See @robertgshaw2-neuralmagic 's comment below for more info)

This PR

Prefix Caching Off

Mean TPOT: 28.02 ms
Mean TTFT: 117.66 ms

Prefix Caching On

Mean TPOT: 15.44 ms
Mean TTFT: 41.87 ms

@ElizaWszola ElizaWszola changed the title [WIP] OrderedDict-based evictor [WIP][PREFIX CACHING FOLLOW UP] OrderedDict-based evictor Mar 15, 2024
@richardliaw
Copy link
Collaborator

cc @cadedaniel

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Mar 16, 2024

This PR addresses the performance issues in the initial version of automatic prefix caching which focused on correctness.

Specifically, in the initial implementation, APC suffered from poor performance after the server has been warmed up due to slow eviction from the cache evictor. This issue will not arise until the server is properly warmed up (i.e. once "all" the blocks are marked as cached"). This PR updates the eviction logic to use an ordered dict, which dramatically improves the performance (turning an O(N) lookup into an O(1) lookup.

@ElizaWszola I added some more datasets to the benchmark scripts to properly run the server benchmark analysis.

  • Note --> sending ~600 ultrachat requests will cause the eviction logic to trigger

The analysis below looks at Mistral-7b on A100-80GB. The Input shapes and request rate result in medium concurrency with ~10 active requests at a time.

ShareGPT

First, we look at the ShareGPT dataset, which has ~no opportunity for prefix caching since none of the prompts are repeated. The goal is to have as little overhead from automatic prefix caching as possible in this case.

Current Main

On the A100-80GB, the performance of the eviction logic has a big negative impact on performance

Prefix Caching Off

  • Mean TPOT: 13.53ms
  • Mean TTFT: 50.00ms

Prefix Caching On

  • Mean TPOT: 59.79ms
  • Mean TTFT: 487.24ms

^ note: this requires warming up the server for ~1000 requests to trigger eviction logic to occur.

This PR

This PR resolves the performance issues on main, driving down to ~3.5% overhead

Prefix Caching Off

  • Mean TPOT: 13.53ms
  • Mean TTFT: 63.33ms

Prefix Caching On

  • Mean TPOT: 14.05ms
  • Mean TTFT: 64.99ms

Sonnet Dataset

Next, we look at the Sonnet dataset, where we repeatedly send prompts with the same exact prompt. This is a best case scenario for APC, and we should expect to see performance gains.

Note: Ignore the Below --- need to re-run with Llama or enable turning off sliding window for Mistral. Zephyr is Mistral based, so it ignores prefix caching on the forward pass due to sliding window.

This PR

Prefix Caching Off

  • Mean TPOT: 27.31ms
  • Mean TTFT: 118.01ms

Prefix Caching On

  • Mean TPOT: 28.40ms
  • Mean TTFT: 118.55ms

Re-create Results

Launch Server

  • Launch Server without Prefix Caching:
python3 -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-beta --disable-log-requests --max-model-len 4096 
  • Launch Server with Prefix Caching:
python3 -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-beta --disable-log-requests --max-model-len 4096 --enable-prefix-caching

Launch Client

  • Warm-up server (to fill up for prefix caching on case):
python3 benchmark_serving_new.py  --model HuggingFaceH4/zephyr-7b-beta --dataset ultrachat --num-prompts 1000 --backend openai --endpoint /v1/completions
  • Run sharegpt requests:
python3 benchmark_serving_new.py  --model HuggingFaceH4/zephyr-7b-beta --dataset sharegpt --request-rate 2.5 --num-prompts 1000 --backend openai --endpoint /v1/completions
  • Run sonnet requests:
python3 benchmark_serving_new.py  --model HuggingFaceH4/zephyr-7b-beta --dataset sonnet --request-rate 2.5 --num-prompts 500 --backend openai --endpoint /v1/completions

@zhuohan123 zhuohan123 self-assigned this Mar 18, 2024
@ElizaWszola ElizaWszola changed the title [WIP][PREFIX CACHING FOLLOW UP] OrderedDict-based evictor [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor Mar 19, 2024
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments I had before our offline chat. Let me know when this PR is ready for review again :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this file to benchmarks/data/sonnet.txt to make the directory look cleaner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it in separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and old benchmark_serving.py? If it's just WIP code it's fine. Otherwise let's include this in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a separate PR that rewrites benchmark_serving.py a bit.

@ElizaWszola
Copy link
Contributor Author

@zhuohan123 I've nuked the benchmark stuff and will move it to a different PR - this PR is ready for a re-review :)

Comment on lines +74 to +77
for _, block in self.free_table.items():
if evicted_block.last_accessed < block.last_accessed:
break
if evicted_block.num_hashed_tokens < block.num_hashed_tokens:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct to me? Say if a block's last_accessed time is changed, it's relative position in the ordereddict will stay the same and will not be updated. Then we might evict a block that is accessed recently but was added to the dict early.

Copy link
Contributor Author

@ElizaWszola ElizaWszola Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evictor contains only blocks with ref count zero. If a block is allocated, it will be popped (allocate() function in CachedBlockAllocator). So when we access the block, we first pop it, then update access time and process it, and then push it again if its ref count goes down to zero again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I got it! This is smart lol

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the clever fix!

@zhuohan123 zhuohan123 merged commit 6ebd02b into vllm-project:main Mar 21, 2024
31 checks passed
tjohnson31415 added a commit to tjohnson31415/vllm that referenced this pull request Mar 21, 2024
* upstream/main:
  [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (vllm-project#3551)
  [Misc][Log] Add log for tokenizer length not equal to vocabulary size (vllm-project#3500)
  [🚀 Ready to be merged] Added support for Jais models (vllm-project#3183)
  Fix 1D query issue from `_prune_hidden_states` (vllm-project#3539)
  [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (vllm-project#3431)
  [BugFix] Hot fix in setup.py for neuron build (vllm-project#3537)
  Migrate `logits` computation and gather to `model_runner` (vllm-project#3233)
  [1/n][Chunked Prefill] Refactor input query shapes (vllm-project#3236)
  [1/n] Triton sampling kernel (vllm-project#3186)
  [Bugfix] Fix ROCm support in CMakeLists.txt (vllm-project#3534)
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Luka <luka@paperspace>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants