Sequence Parallel #1041

ZYHowell · 2024-08-12T02:09:42Z

Motivation

When serving an extremely large model (e.g. Llama 400B), the #GPU might be more than #kv head. This leads to a replication on kv cache, which is troublesome when the sequence length is too large

Modification

This PR introduced a very basic sequence parallelism on the attention computation. For all other parts, the model is still fully tensor parallelized. The partition switches before and after the attention. This is achieved by:

When preparing the batch, collocate input ids on the same sequence parallel rank (sp_rank) together, this is referred as the sequence parallel layout in this pr and the code comments. Flash infer args are accordingly changed;
Before entering the SP part, only the KV locally stored is computed. (python/sglang/srt/layers/linear.py)
The SP kernel, which still has some space to improve. (python/sglang/srt/layers/radix_attention.py)
When leaving the SP part, the whole sequence is collected again, because the rest part takes the whole sequence.
The output logits are switched back before doing the sampling.
MISC modification including: Parallel State (python/sglang/srt/layers/parallel_utils/parallel_state.py), calling all components in model runner (python/sglang/srt/managers/controller/model_runner.py) and the model definition (python/sglang/srt/models/llama2.py), server args (python/sglang/srt/server_args.py), and tests

Checklist

Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

… local kv cache

Sequence Parallel system setup

* update layout * bug fix

* test: test cases of combining multiple attention kernel calls to implement a sequence parallel kernel. Verified with 2 sp workers * fix: simplify flashinfer kernel initialization (begin_forward() and end_forward()) * test: add logic for sp worker 1 which is basically the same but with different orders of kernel calls * chore: format tweak * feat: a general seq parallel attention kernel that achieves workload balance * fix: minor tweak loop iteration within ring attention * feat [radix_attention]: seq_parallel kernel with sync communication. TODO: turn communication into async fashion and overlap it with computation * test: update test cases for seq parallel attn kernel. Need to disable kv cache management before testing because we haven't implemented kv cache management for seq parallel yet * chore [radix_attention]: format tweak * feat: async communication within ring attention * fix [parallel_utils]: add missed files * fix [infer_batch]: set default values for newly added sp-related metadata * fix [bench_latency]: minor fixes to input args * feat [parallel_utils]: get actual tp rank and size when both TP and SP are enabled * feat [linear]: add QKVParallelLinear * feat [llama2]: update llama model to use our QKVParallelLinear * feat [model_runner]: initialize model parallel with sequence parallel * fix [infer_batch]: 1. a minor issue when calling get_prefill_indices; 2. flashinfer intialization args * fix [bench_latency]: load model with sp_rank * feat [radix_attention]: automatically dispatch to seq-parallel attn kernel when sp_size > 1 * debug: stash current debug changes * fix [radix_attention]: reshape q tensor before running the kernel * bug fix for sp layout types * fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded values * fix [wip]: disable p2p communication within ring attention for now. TODO: fix the bug that causes communication hang. * chore [bench_latency]: disable decode for now since we haven't supported it * upstream with correct prefill sp layout * fix early exit on decode SP * chore: tweak format * update layout * bug fix * fix [linear, radix_attention]: fix q head indexes per SP worker to align with GQA setting. * fix [infer_batch]: set up flashinfer kernels for the batch size > 1 case * chore: tweak format * fix [radix_attention]: revert commented-out kv cache store operations in normal attention * fix: adjust k, v tensor shape to align with both TP and SP setting * chore [llama2]: minor adjustment * fix: update bench_latency to evenly distribute each sequence across all SP workers to avoid the layout issue * test: update test cases to align with current kernel in args * fix [model_runner]: initialize TokenToKVPool with correct num_heads and enable KV cache store in SP attention * chore [radix_attention]: clean up comments * fix [model_runner]: correct num_heads in memory profiling as well to avoid OOM * fix [infer_batch]: adopt SP KV cache allocation * feat [linear]: correctly partition q proj along the num_heads dimension with GQA * chore [llama2]: clean up stable variables * feat [infer_batch]: adjust positions to SP layout when preparing input_metadata * feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard attn * feat [parallel_state]: creat sequence parallel comm groups * test [sp_comm_group]: simple test case with sp_size = 2 * doc [parallel_state]: doc string for our SP group organization * fix [infer_batch]: add padding zeros to positions tensor and out_cache_loc to fix positional encoding and KV cache store * feat [radix_attn, infer_batch]: create masks for padded sequences and now attn works for unevenly-distributed sequenses too * chore [bench_latency]: revert original prompts * fix [parallel_state]: rename "actual" to "kv" * refactor [radix_attention]: unified two cases with differnt comm-comp tradeoffs * chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]" * fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_args * fix [infer_batch]: only pad positions and out_cache_loc for prefill * chore [linear]: clean up and revise comments * chore [parallel_state]: revise comments * chore [linear]: revise comments and class names * chore [radix_attention]: add defensive checks --------- Co-authored-by: ZYHowell <yhzhuang@cmu.edu>

…e flashinfer decode kernel (#6)

zhyncs · 2024-08-12T06:31:44Z

@ZYHowell Thanks for your contribution! May you rebase the latest main branch and resolve the conflicts? Thanks!

… llama weight load

…avoid it in update_flashinfer_indices

…ialization. TODO: init SP parameters correctly

rebase to 2561ed0

* test: test cases of combining multiple attention kernel calls to implement a sequence parallel kernel. Verified with 2 sp workers * fix: simplify flashinfer kernel initialization (begin_forward() and end_forward()) * test: add logic for sp worker 1 which is basically the same but with different orders of kernel calls * chore: format tweak * feat: a general seq parallel attention kernel that achieves workload balance * fix: minor tweak loop iteration within ring attention * feat [radix_attention]: seq_parallel kernel with sync communication. TODO: turn communication into async fashion and overlap it with computation * test: update test cases for seq parallel attn kernel. Need to disable kv cache management before testing because we haven't implemented kv cache management for seq parallel yet * chore [radix_attention]: format tweak * feat: async communication within ring attention * fix [parallel_utils]: add missed files * fix [infer_batch]: set default values for newly added sp-related metadata * fix [bench_latency]: minor fixes to input args * feat [parallel_utils]: get actual tp rank and size when both TP and SP are enabled * feat [linear]: add QKVParallelLinear * feat [llama2]: update llama model to use our QKVParallelLinear * feat [model_runner]: initialize model parallel with sequence parallel * fix [infer_batch]: 1. a minor issue when calling get_prefill_indices; 2. flashinfer intialization args * fix [bench_latency]: load model with sp_rank * feat [radix_attention]: automatically dispatch to seq-parallel attn kernel when sp_size > 1 * debug: stash current debug changes * fix [radix_attention]: reshape q tensor before running the kernel * bug fix for sp layout types * fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded values * fix [wip]: disable p2p communication within ring attention for now. TODO: fix the bug that causes communication hang. * chore [bench_latency]: disable decode for now since we haven't supported it * upstream with correct prefill sp layout * fix early exit on decode SP * chore: tweak format * update layout * bug fix * fix [linear, radix_attention]: fix q head indexes per SP worker to align with GQA setting. * fix [infer_batch]: set up flashinfer kernels for the batch size > 1 case * chore: tweak format * fix [radix_attention]: revert commented-out kv cache store operations in normal attention * fix: adjust k, v tensor shape to align with both TP and SP setting * chore [llama2]: minor adjustment * fix: update bench_latency to evenly distribute each sequence across all SP workers to avoid the layout issue * test: update test cases to align with current kernel in args * fix [model_runner]: initialize TokenToKVPool with correct num_heads and enable KV cache store in SP attention * chore [radix_attention]: clean up comments * fix [model_runner]: correct num_heads in memory profiling as well to avoid OOM * fix [infer_batch]: adopt SP KV cache allocation * feat [linear]: correctly partition q proj along the num_heads dimension with GQA * chore [llama2]: clean up stable variables * feat [infer_batch]: adjust positions to SP layout when preparing input_metadata * feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard attn * feat [parallel_state]: creat sequence parallel comm groups * test [sp_comm_group]: simple test case with sp_size = 2 * doc [parallel_state]: doc string for our SP group organization * fix [infer_batch]: add padding zeros to positions tensor and out_cache_loc to fix positional encoding and KV cache store * feat [radix_attn, infer_batch]: create masks for padded sequences and now attn works for unevenly-distributed sequenses too * chore [bench_latency]: revert original prompts * fix [parallel_state]: rename "actual" to "kv" * refactor [radix_attention]: unified two cases with differnt comm-comp tradeoffs * chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]" * fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_args * fix [infer_batch]: only pad positions and out_cache_loc for prefill * chore [linear]: clean up and revise comments * chore [parallel_state]: revise comments * chore [linear]: revise comments and class names * chore [radix_attention]: add defensive checks --------- Co-authored-by: ZYHowell <yhzhuang@cmu.edu>

Rebase ab4a83b

merrymercy · 2024-09-17T08:15:54Z

moved to #1436

ZYHowell and others added 24 commits July 19, 2024 13:27

add sp index

f498ad1

add clone for rope as it's in-place

4a807ec

add decode mask for sp

285348c

insert to prepare batch

5b2a048

add sp size and rank args

f8b8dbc

update sequence parallel layout

de61f42

minor bug fix to pass sp=1 test

073b9dc

give local indices to help with position ids; prepare for only record…

9599131

… local kv cache

minor fix

152666f

add sp layout to normal layout

50436b7

move sp layout transform tool to inputmetadata

8f8db37

add debug flatten to sp

fd49bf4

update name and doc string

1785ebf

bug fix for the sp=1 case

c8d850b

fix prefix lens None

8f46fee

fix debug mode indices

73af1b6

runnable but only first two decode tok cor

1afdae2

fix early exit for decode with SP

2e41f46

format

dd2382d

Merge pull request #1 from ivanium/pr-sp-rope

a11bc61

Sequence Parallel system setup

Update sp layout (#3)

4b8203a

* update layout * bug fix

Sequence Parallel Decode Attn Kernel (#5)

1695aed

fix [infer_batch]: fix _get_decode_local_lens and use it to initializ…

639e716

…e flashinfer decode kernel (#6)

zhyncs mentioned this pull request Aug 12, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

ZYHowell added 4 commits August 15, 2024 12:27

all changes

2ab47a2

merge

9f0f3eb

Merge remote-tracking branch 'origin/main' into pr-rebase

4930fb9

fix lint and test import

4de47a0

ivanium and others added 8 commits August 24, 2024 00:29

fix [sp_linear, llama2]: add prefix support for SP linear layers. Fix…

f23b4cf

… llama weight load

fix [forward_batch_info]: prefix_lens can be None in decode phase so …

beb7494

…avoid it in update_flashinfer_indices

fix [radix_attention]: remove duplicated sp extend/decode functions

6f9bb27

fix [radix_attention]: kv_data -> get_kv_buffer()

e5878f5

minor bug fix

14fc6e2

fix out cache loc

69716be

fix [forward_batch_info]: init flashinfer ragged kernel correctly for SP

a8a3d55

fix [cuda_graph_runner]: add missing SP parameters to CUDA graph init…

86fdc23

…ialization. TODO: init SP parameters correctly

ivanium force-pushed the main branch from 22e0438 to 639e716 Compare September 2, 2024 20:18

ZYHowell and others added 8 commits September 2, 2024 16:23

Merge pull request #9 from ivanium/pr-rebase

f1ad3ee

Merge remote-tracking branch 'upstream/main' into pr-rebase-2561ed0

3b92e9d

minor fix

ef22804

contiguous and to device for triton kernel

daaef64

fix paged kernel lens before kv indptr

66ecf83

Merge pull request #10 from ivanium/pr-rebase-2561ed0

8b3662a

rebase to 2561ed0

Merge pull request #11 from ivanium/pr-rebase-ab4a83b

0175ca2

Rebase ab4a83b

Ying1123 mentioned this pull request Sep 16, 2024

[Feature] Add initial support for sequence parallelism #1436

Draft

merrymercy closed this Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence Parallel #1041

Sequence Parallel #1041

ZYHowell commented Aug 12, 2024 •

edited by hnyls2002

Loading

zhyncs commented Aug 12, 2024

merrymercy commented Sep 17, 2024

Sequence Parallel #1041

Sequence Parallel #1041

Conversation

ZYHowell commented Aug 12, 2024 • edited by hnyls2002 Loading

Motivation

Modification

Checklist

zhyncs commented Aug 12, 2024

merrymercy commented Sep 17, 2024

ZYHowell commented Aug 12, 2024 •

edited by hnyls2002

Loading