[Feature] Zigzag Ring attention #5905

Edenzzzz · 2024-07-12T14:54:53Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Implements Ring Attention from https://arxiv.org/abs/2310.01889 and supports Llama. Adopts the zigzag batch splitting scheme for load-balancing (see illustration below).

Supports both batched sequences and packed sequences.
Use two streams in forward pass to overlap the low-occupancy output correction with next step's flash attn kernel. Carefully optimizes buffers to minimize memory pressure. Provides a experimental Triton kernel for output correction
Save comm & compute by replacing the gather sequence op in cross entropy wtih reducing loss.
Enable TP + SP in plugin and update corresponding llama and command policies.

！Future work

For easy compatibility with other SP and HF interfaces, in the packed sequence mode we do not flatten the batch dim but recreates a new tensor with shape (bs, max_seqlen // sp_size). To eliminate this overhead, we can adapt each model's forward to accept flattened inputs.
Pad each seq to be divisible by sp_size.
Place the grad checkpoints between attn and FFN so that we only recompute FFN, not ring attention (see https://github.com/RulinShao/LightSeq/blob/8f486dad3d0670057dfbe3b30c003080b61c5325/lightseq/lightseq_ckpt_monkey_patch.py#L499)

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

colossalai/shardformer/layer/_operation.py

colossalai/shardformer/layer/utils.py

colossalai/lazy/pretrained.py

Edenzzzz · 2024-07-23T02:56:13Z

.pre-commit-config.yaml

colossalai/shardformer/layer/attn.py

colossalai/shardformer/modeling/llama.py

examples/language/llama/benchmark.py

colossalai/booster/plugin/hybrid_parallel_plugin.py

colossalai/shardformer/layer/attn.py

* add SimPO * fix dataloader * remove debug code * add orpo * fix style * fix colossalai, transformers version * fix colossalai, transformers version * fix colossalai, transformers version * fix torch colossalai version * update transformers version * [shardformer] DeepseekMoE support (#5871) * [Feature] deepseek moe expert parallel implement * [misc] fix typo, remove redundant file (#5867) * [misc] fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] deepseek support & unit test * [misc] remove debug code & useless print * [misc] fix typos (#5872) * [Feature] remove modeling file, use auto config. (#5884) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [Deepseek] remove redundant code (#5888) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [misc] remove redundant code * [Feature/deepseek] resolve comment. (#5889) * [misc] fix typos * [Feature] deepseek support via auto model, remove modeling file * [misc] delete useless file * [misc] fix typos * [misc] remove redundant code * [misc] mv module replacement into if branch * [misc] add some warning message and modify some code in unit test * [misc] fix typos --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Hoxfix] Fix CUDA_DEVICE_MAX_CONNECTIONS for comm overlap Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [Feat] Diffusion Model(PixArtAlpha/StableDiffusion3) Support (#5838) * Diffusion Model Inference support * Stable Diffusion 3 Support * pixartalpha support * [HotFix] CI,import,requirements-test for #5838 (#5892) * [Hot Fix] CI,import,requirements-test --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] Enable PP + SP for llama (#5868) * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use a one cross entropy func for all shardformer models --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [ShardFormer] Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM (#5897) * add benchmark for sft, dpo, simpo, orpo. Add benchmarking result. Support lora with gradient checkpoint * fix style * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix eval * hotfix citation * [zero] support all-gather overlap (#5898) * [zero] support all-gather overlap * [zero] add overlap all-gather flag * [misc] fix typo * [zero] update api * fix orpo cross entropy loss * [Auto Parallel]: Speed up intra-op plan generation by 44% (#5446) * Remove unnecessary calls to deepcopy * Build DimSpec's difference dict only once This change considerably speeds up construction speed of DimSpec objects. The difference_dict is the same for each DimSpec object, so a single copy of it is enough. * Fix documentation of DimSpec's difference method * [ShardFormer] fix qwen2 sp (#5903) * [compatibility] support torch 2.2 (#5875) * Support Pytorch 2.2.2 * keep build_on_pr file and update .compatibility * fix object_to_tensor usage when torch>=2.3.0 (#5820) * [misc] support torch2.3 (#5893) * [misc] support torch2.3 * [devops] update compatibility ci * [devops] update compatibility ci * [devops] add debug * [devops] add debug * [devops] add debug * [devops] add debug * [devops] remove debug * [devops] remove debug * [release] update version (#5912) * [plugin] support all-gather overlap for hybrid parallel (#5919) * [plugin] fixed all-gather overlap support for hybrid parallel * add kto * fix style, add kto data sample * [Examples] Add lazy init to OPT and GPT examples (#5924) Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [ColossalChat] Hotfix for ColossalChat (#5910) * add ignore and tiny llama * fix path issue * run style * fix issue * update bash * add ignore and tiny llama * fix path issue * run style * fix issue * update bash * fix ddp issue * add Qwen 1.5 32B * refactor tokenization * [FIX BUG] UnboundLocalError: cannot access local variable 'default_conversation' where it is not associated with a value (#5931) * cannot access local variable 'default_conversation' where it is not associated with a value set default value for 'default_conversation' * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix test data * refactor evaluation * remove real data path * remove real data path * Add n_fused as an input from native_module (#5894) * [FIX BUG] convert env param to int in (#5934) * [Hotfix] Fix ZeRO typo #5936 Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [Feature] Add a switch to control whether the model checkpoint needs to be saved after each epoch ends (#5941) * Add a switch to control whether the model checkpoint needs to be saved after each epoch ends * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix style * fix style * fix style * [shardformer] hotfix attn mask (#5945) * [shardformer] hotfix attn mask (#5947) * [Feat] Distrifusion Acceleration Support for Diffusion Inference (#5895) * Distrifusion Support source * comp comm overlap optimization * sd3 benchmark * pixart distrifusion bug fix * sd3 bug fix and benchmark * generation bug fix * naming fix * add docstring, fix counter and shape error * add reference * readme and requirement * [zero] hotfix update master params (#5951) * [release] update version (#5952) * [Chat] Fix lora (#5946) * fix merging * remove filepath * fix style * Update README.md (#5958) * [hotfix] Remove unused plan section (#5957) * remove readme * fix readme * update * [test] add mixtral for sequence classification * [test] add mixtral transformer test * [moe] fix plugin * [test] mixtra pp shard test * [chore] handle non member group * [zero] solve hang * [test] pass mixtral shardformer test * [moe] implement transit between non moe tp and ep * [zero] solve hang * [misc] solve booster hang by rename the variable * solve hang when parallel mode = pp + dp * [moe] implement submesh initialization * [moe] add mixtral dp grad scaling when not all experts are activated * [chore] manually revert unintended commit * [chore] trivial fix * [chore] arg pass & remove drop token * [test] add mixtral modelling test * [moe] implement tp * [moe] test deepseek * [moe] clean legacy code * [Feature] MoE Ulysses Support (#5918) * moe sp support * moe sp bug solve * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [chore] minor fix * [moe] init moe plugin comm setting with sp * moe sp + ep bug fix * [moe] finalize test (no pp) * [moe] full test for deepseek and mixtral (pp + sp to fix) * [chore] minor fix after rebase * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [chore] solve moe ckpt test failure and some other arg pass failure * [moe] remove ops * [test] fix test: test_zero1_2 * [bug] fix: somehow logger hangs the program * [moe] deepseek moe sp support * [test] add check * [deepseek] replace attn (a workaround for bug in transformers) * [misc] skip redunant test * [misc] remove debug/print code * [moe] refactor mesh assignment * Revert "[moe] implement submesh initialization" This reverts commit 2f9bce6. * [chore] change moe_pg_mesh to private * [misc] remove incompatible test config * [misc] fix ci failure: change default value to false in moe plugin * [misc] remove useless condition * [chore] docstring * [moe] remove force_overlap_comm flag and add warning instead * [doc] add MoeHybridParallelPlugin docstring * [moe] solve dp axis issue * [chore] remove redundant test case, print string & reduce test tokens * [feat] Dist Loader for Eval (#5950) * support auto distributed data loader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support auto distributed data loader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix tp error * remove unused parameters * remove unused * update inference * update docs * update inference --------- Co-authored-by: Michelle <qianranma8@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [lora] lora support hybrid parallel plugin (#5956) * lora support hybrid plugin * fix * fix * fix * fix * Support overall loss, update KTO logging * [Docs] clarify launch port Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [Hotfix] README link (#5966) * update ignore * update readme * run style * update readme * [Hotfix] Avoid fused RMSnorm import error without apex (#5985) Co-authored-by: Edenzzzz <wtan45@wisc.edu> * [Chat] fix readme (#5989) * fix readme * fix readme, tokenization fully tested * fix readme, tokenization fully tested * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: root <root@notebook-8f919155-6035-47b4-9c6f-1be133b9e2c9-0.notebook-8f919155-6035-47b4-9c6f-1be133b9e2c9.colossal-ai.svc.cluster.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix sync condition (#6000) * [plugin] add cast inputs option for zero (#6003) * [pre-commit.ci] pre-commit autoupdate (#5995) updates: - [github.com/psf/black-pre-commit-mirror: 24.4.2 → 24.8.0](psf/black-pre-commit-mirror@24.4.2...24.8.0) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [misc] Bypass the huggingface bug to solve the mask mismatch problem (#5991) * [Feature] Zigzag Ring attention (#5905) * halfway * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unified cross entropy func for all shardformer models * remove redundant lines * add basic ring attn; debug cross entropy * fwd bwd logic complete * fwd bwd logic complete; add experimental triton rescale * precision tests passed * precision tests passed * fix typos and remove misc files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add sp_mode to benchmark; fix varlen interface * update softmax_lse shape by new interface * change tester name * remove buffer clone; support packed seq layout * add varlen tests * fix typo * all tests passed * add dkv_group; fix mask * remove debug statements --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [misc] update compatibility (#6008) * [misc] update compatibility * [misc] update requirements * [devops] disable requirements cache * [test] fix torch ddp test * [test] fix rerun on address in use * [test] fix lazy init * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the merge * fix the merge * overlap kv comm with output rescale (#6017) Co-authored-by: Edenzzzz <wtan45@wisc.edu> * fix the merge * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the merge * fix * fix * fix the merge * fix * [misc] Use dist logger in plugins (#6011) * use dist logger in plugins * remove trash * print on rank 0 --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> * fix * fix * fix * fix * fix the merge * fix * fix * fix * fix --------- Co-authored-by: YeAnbang <anbangy2@outlook.com> Co-authored-by: Haze188 <haze188@qq.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Edenzzzz <wenxuan.tan@wisc.edu> Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Co-authored-by: Guangyao Zhang <xjtu521@qq.com> Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com> Co-authored-by: Hongxin Liu <lhx0217@gmail.com> Co-authored-by: Stephan Kö <stephankoe@users.noreply.github.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: Tong Li <tong.li352711588@gmail.com> Co-authored-by: zhurunhua <1281592874@qq.com> Co-authored-by: Insu Jang <insujang@umich.edu> Co-authored-by: Gao, Ruiyuan <905370712@qq.com> Co-authored-by: hxwang <wang1570@e.ntu.edu.sg> Co-authored-by: Michelle <qianranma8@gmail.com> Co-authored-by: root <root@notebook-8f919155-6035-47b4-9c6f-1be133b9e2c9-0.notebook-8f919155-6035-47b4-9c6f-1be133b9e2c9.colossal-ai.svc.cluster.local>

Edenzzzz requested a review from a team as a code owner July 12, 2024 14:54

Edenzzzz force-pushed the ring-attn branch 2 times, most recently from 5cd750b to 77f4eaf Compare July 14, 2024 14:18

ver217 reviewed Jul 17, 2024

View reviewed changes

colossalai/shardformer/layer/_operation.py Outdated Show resolved Hide resolved

colossalai/shardformer/layer/utils.py Outdated Show resolved Hide resolved

Edenzzzz force-pushed the ring-attn branch 3 times, most recently from 0ceab42 to 501205d Compare July 19, 2024 00:42

Edenzzzz commented Jul 21, 2024

View reviewed changes

colossalai/lazy/pretrained.py Show resolved Hide resolved

Edenzzzz force-pushed the ring-attn branch 3 times, most recently from 49cb342 to c46f7c8 Compare July 22, 2024 04:07

Edenzzzz enabled auto-merge (squash) July 22, 2024 04:09

Edenzzzz force-pushed the ring-attn branch 2 times, most recently from 62265ab to f326884 Compare July 22, 2024 10:12

Edenzzzz force-pushed the ring-attn branch 4 times, most recently from 9164c4a to 05017d3 Compare August 1, 2024 03:39

Edenzzzz disabled auto-merge August 1, 2024 08:53

Edenzzzz enabled auto-merge (squash) August 1, 2024 08:53

Edenzzzz force-pushed the ring-attn branch 5 times, most recently from 6bf9936 to bd1bfcb Compare August 7, 2024 11:48

Edenzzzz added 4 commits August 7, 2024 21:33

halfway

519818f

fix cross-PP-stage position id length diff bug

04b14a2

fix typo

45b9ac1

fix typo

3047c4e

Edenzzzz and others added 3 commits August 7, 2024 21:37

q1 index only once

e760507

remove events to simplify stream sync

e90e984

clarify kv_comm.wait()

e26c910

Edenzzzz force-pushed the ring-attn branch from bd1bfcb to e26c910 Compare August 8, 2024 02:38

ver217 reviewed Aug 8, 2024

View reviewed changes

.pre-commit-config.yaml Outdated Show resolved Hide resolved

colossalai/shardformer/layer/attn.py Outdated Show resolved Hide resolved

colossalai/shardformer/layer/attn.py Outdated Show resolved Hide resolved

colossalai/shardformer/layer/attn.py Outdated Show resolved Hide resolved

Edenzzzz force-pushed the ring-attn branch 2 times, most recently from 18e4c88 to 02466c6 Compare August 8, 2024 19:01

use torch.compile; add nsys

b6b2333

Edenzzzz force-pushed the ring-attn branch from 02466c6 to b6b2333 Compare August 9, 2024 01:36

ver217 reviewed Aug 9, 2024

View reviewed changes

simplify forward/backward logic

d3831b4

Edenzzzz force-pushed the ring-attn branch 2 times, most recently from ff0c09c to caecd90 Compare August 12, 2024 11:09

2d ring forward passed

0094bc0

Edenzzzz force-pushed the ring-attn branch from caecd90 to 0094bc0 Compare August 12, 2024 11:10

2d ring backward passed

581ec0f

Edenzzzz force-pushed the ring-attn branch from 2ec3eca to 581ec0f Compare August 13, 2024 14:47

Edenzzzz added 2 commits August 14, 2024 06:03

fixes

1344849

fix ring attn loss

e6bcde2

Edenzzzz force-pushed the ring-attn branch 2 times, most recently from 31f8e34 to c663265 Compare August 14, 2024 09:12

2D ring backward + llama passed

b4c0809

Edenzzzz force-pushed the ring-attn branch from c663265 to b4c0809 Compare August 14, 2024 09:13

ver217 reviewed Aug 15, 2024

View reviewed changes

colossalai/booster/plugin/hybrid_parallel_plugin.py Show resolved Hide resolved

colossalai/shardformer/layer/attn.py Outdated Show resolved Hide resolved

colossalai/shardformer/layer/attn.py Outdated Show resolved Hide resolved

Edenzzzz added 3 commits August 15, 2024 03:17

follow conventions

26b008e

fix dist logger

a68dd2f

add a manual inner ring size option

be5fed5

ver217 approved these changes Aug 16, 2024

View reviewed changes

Edenzzzz merged commit f5c84af into hpcaitech:main Aug 16, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Zigzag Ring attention #5905

[Feature] Zigzag Ring attention #5905

Edenzzzz commented Jul 12, 2024 •

edited

Loading

Edenzzzz commented Jul 23, 2024

[Feature] Zigzag Ring attention #5905

[Feature] Zigzag Ring attention #5905

Conversation

Edenzzzz commented Jul 12, 2024 • edited Loading

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

！Future work

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Edenzzzz commented Jul 23, 2024

Edenzzzz commented Jul 12, 2024 •

edited

Loading