add accept num, emit num metric for ChainSpeculativeSampling #450

LiuXiaoxuanPKU · 2024-08-16T06:31:36Z

No description provided.

yzh119

Thanks for your contribution @LiuXiaoxuanPKU ! I added a few comments and suggestions.

include/flashinfer/sampling.cuh

python/csrc/sampling.cu

include/flashinfer/sampling.cuh

yzh119 · 2024-08-16T06:54:37Z

This file has to be updated:

flashinfer/python/csrc/flashinfer_ops.h

Lines 70 to 72 in 338b2f5

    
           torch::Tensor chain_speculative_sampling(torch::Tensor draft_probs, torch::Tensor draft_token_ids, 
        
                                                    torch::Tensor uniform_samples, torch::Tensor target_probs, 
        
                                                    bool deterministic);

if you change the function signature.

LiuXiaoxuanPKU · 2024-08-17T05:10:38Z

@yzh119 Thanks for the review. I just fixed the comments, feel free to take another pass. Thanks!

yzh119

LGTM, thank you!

@LiuXiaoxuanPKU

🤖 I have created a release *beep* *boop* --- ## [0.1.6](v0.1.5...v0.1.6) (2024-08-27) ### SM75 Support Starting from [0.1.6](v0.1.5...v0.1.6), our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080). ### API Changes #### `plan`/`run` Since [0.1.6](v0.1.5...v0.1.6) on, `begin_forward`/`forward`/`end_forward` APIs are replaced with the new `plan`/`run` API. - `forward` is renamed to `run`, which is more precise and consistent with the naming convention of cutlass's python API. - `begin_forward` is renamed to `plan`, which is consistent with the naming convention of nvmath API. - `end_forward` is deprecated and has no effect after this PR. There is some slight difference between the old `forward` and the new `run` API: - All extra arguments such as `causal` and `logits_soft_cap` will be provided in `plan` (previously `begin_forward`) API, and cached until next `plan` call, and we only need to provide query and KV-Cache tensors in `run` API. The old `begin_forward`/`forward`/`end_forward` APIs are still functional, but we will gradually deprecate them in future releases. Check [#466](#466) for more details. #### `MultiLevelCascadeAttentionWrapper` Since [0.1.6](v0.1.5...v0.1.6) on, we introduce a new `MultiLevelCascadeAttentionWrapper` API for cascade inference, which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache. See [documentation](https://docs.flashinfer.ai/api/python/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) and [tutorial](https://docs.flashinfer.ai/tutorials/kv_layout.html#multi-level-cascade-inference-data-layout) on API usage and layout explaination. The old `BatchDecodeWithSharedPrefixPagedKVCacheWrapper` and `BatchPrefillWithSharedPrefixPagedKVCacheWrapper` will be deprecated in future releases. ### Features * sm75 support ([#448](#448), [#449](#449)) * add `MultiLevelCascadeAttentionWrapper` API ([#462](#462)) ([1e37989](1e37989)) * add accept num, emit num metric for ChainSpeculativeSampling ([#450](#450)) ([fa38b5e](fa38b5e)) * support bmm fp8 ([#469](#469)) ([f1c0b68](f1c0b68)) ### Refactor * refactor: replace `begin_forward`/`forward`/`end_forward` with `plan`/`run` [#466](#466) ### Misc * misc: improve error handling of sampling kernels ([#456](#456)) ([0dce178](0dce178)) ### Performance Improvements * slight optimization on f16->f8 fragment layout swizzling ([#453](#453)) ([0d61871](0d61871)) * slight optimization on fragment layout swizzle ([#458](#458)) ([7c397cb](7c397cb)) * use persistent kernel for merging attention states ([#459](#459)) ([be6bf5b](be6bf5b)) ### Acknowledgement We thank [@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU) on enhance of speculative sampling operator, [@merrymercy](https://github.com/merrymercy) on API change suggestion and [@zhyncs](https://github.com/zhyncs) on integrating fp8 BMM cublas implementation. --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

add accept num, emit num

299d45b

zhyncs requested review from yzh119 and zhyncs August 16, 2024 06:33

zhyncs self-assigned this Aug 16, 2024

yzh119 reviewed Aug 16, 2024

View reviewed changes

include/flashinfer/sampling.cuh Outdated Show resolved Hide resolved

python/csrc/sampling.cu Outdated Show resolved Hide resolved

python/csrc/sampling.cu Outdated Show resolved Hide resolved

include/flashinfer/sampling.cuh Outdated Show resolved Hide resolved

LiuXiaoxuanPKU added 2 commits August 16, 2024 22:05

fix comments

032d0dc

minor

fbd5e4e

yzh119 approved these changes Aug 17, 2024

View reviewed changes

yzh119 merged commit fa38b5e into flashinfer-ai:main Aug 17, 2024

github-actions bot mentioned this pull request Aug 17, 2024

chore(main): release 0.1.6 #447

Merged

LiuXiaoxuanPKU mentioned this pull request Aug 18, 2024

[SpecDecode][Kernel] Use Flashinfer for Rejection Sampling in Speculative Decoding vllm-project/vllm#7244

Merged

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add accept num, emit num metric for ChainSpeculativeSampling #450

add accept num, emit num metric for ChainSpeculativeSampling #450

LiuXiaoxuanPKU commented Aug 16, 2024

yzh119 left a comment

yzh119 commented Aug 16, 2024

LiuXiaoxuanPKU commented Aug 17, 2024

yzh119 left a comment

add accept num, emit num metric for ChainSpeculativeSampling #450

add accept num, emit num metric for ChainSpeculativeSampling #450

Conversation

LiuXiaoxuanPKU commented Aug 16, 2024

yzh119 left a comment

Choose a reason for hiding this comment

yzh119 commented Aug 16, 2024

LiuXiaoxuanPKU commented Aug 17, 2024

yzh119 left a comment

Choose a reason for hiding this comment