You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Woo, Thank you @zhyncs.
just try new image lmsysorg/sglang:v0.4.3.post2-cu125
the performance seems similar than 0.4.2 (on 16 x H20)
when running-req = 1, the gen throughput (token/s) is no more than previous.
Triton Backend
@ispobock @pankajroark
refactor triton backend 1, 2
support custom mask
support EAGLE 2
compatible with CUDA Graph
support nextn I (single MTP head)
support next II (multi MTP heads) (WIP @pankajroark )
FlashInfer Backend
@zhyncs @yzh119
compatible with disable MLA
support FlashInfer nightly MLA ragged prefill and CUDA Core MLA decoding
support FlashInfer v0.2.0.post3 MLA ragged, paged prefill and decoding (@zhyncs @yzh119 )
nextn parts can be shared with Triton Backend
EAGLE 2
@zhyncs @Ying1123
implement sampling kernel in sgl-kernel (drop cutex) kernel part, python part
bunch of fixes non greedy fix, disable cuda graph fix 1, fix 2, cleanup 1, cleanup 2, fix cuda graph capture failure, fix 2, reduce one draft forward
compatible with radix cache and chunked prefill (WIP @Ying1123 )
The text was updated successfully, but these errors were encountered: