diff --git a/doc/code-docs/locales/en/LC_MESSAGES/parallel.po b/doc/code-docs/locales/en/LC_MESSAGES/parallel.po index b948e4f9..ab842d15 100644 --- a/doc/code-docs/locales/en/LC_MESSAGES/parallel.po +++ b/doc/code-docs/locales/en/LC_MESSAGES/parallel.po @@ -7,7 +7,7 @@ msgid "" msgstr "" "Project-Id-Version: InternLM \n" "Report-Msgid-Bugs-To: \n" -"POT-Creation-Date: 2024-08-30 15:51+0800\n" +"POT-Creation-Date: 2024-10-08 17:17+0800\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME \n" "Language: en\n" @@ -16,7 +16,7 @@ msgstr "" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=utf-8\n" "Content-Transfer-Encoding: 8bit\n" -"Generated-By: Babel 2.15.0\n" +"Generated-By: Babel 2.12.1\n" #: ../../source/parallel.rst:2 msgid "并行模式与原理" @@ -274,15 +274,15 @@ msgstr "" " sequence parallelism and tensor parallelism, performing an ``all-" "gather`` operation along the seqlen dimension of activation values. After" " this communication is completed, the shape of the activation values " -"becomes the full ``[seqlen, hidden_size]`` , and then it enters the scope " -"of the tensor parallelism module. The communication of ``ḡ`` is situated " -"at the junction of tensor parallelism and sequence parallelism, requiring" -" the transformation of the ``all-reduce`` communication operation from " -"MTP into a ``reduce-scatter`` operation to achieve the split along the " -"seqlen dimension. This results in the activation values having a shape of" -" ``[seqlen/tp, hidden_size]`` , enabling a smooth transition into the " -"sequence parallelism phase. The same principles apply during the backward" -" pass." +"becomes the full ``[seqlen, hidden_size]`` , and then it enters the scope" +" of the tensor parallelism module. The communication of ``ḡ`` is situated" +" at the junction of tensor parallelism and sequence parallelism, " +"requiring the transformation of the ``all-reduce`` communication " +"operation from MTP into a ``reduce-scatter`` operation to achieve the " +"split along the seqlen dimension. This results in the activation values " +"having a shape of ``[seqlen/tp, hidden_size]`` , enabling a smooth " +"transition into the sequence parallelism phase. The same principles apply" +" during the backward pass." #: ../../source/parallel.rst:85 msgid "FSP" @@ -448,7 +448,13 @@ msgid "" ":class:`NonPipelineSchedule`." msgstr "" -#: ../../source/parallel.rst +#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step +#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler +#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step +#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.pre_processing +#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.belongs_to_current_rank +#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step +#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.zero_grad of msgid "参数" msgstr "Parameter" @@ -532,7 +538,10 @@ msgstr "" msgid "If False, the output and label won't be returned." msgstr "" -#: ../../source/parallel.rst +#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step +#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step +#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.belongs_to_current_rank +#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step of msgid "返回" msgstr "" @@ -559,7 +568,10 @@ msgid "" "accumulated from all stages." msgstr "" -#: ../../source/parallel.rst +#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step +#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step +#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.belongs_to_current_rank +#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step of msgid "返回类型" msgstr "Return type" @@ -745,6 +757,157 @@ msgstr "" msgid "Whether the gradient is success updated, and the gradient." msgstr "" +#: ../../source/parallel.rst:214 ../../source/parallel.rst:228 +msgid "2D-Attention" +msgstr "2D-Attention" + +#: ../../source/parallel.rst:215 +msgid "" +"2D-Attention是InternEvo系统扩展ISP的序列化并行方案,集成了Ring-Attention和ISP,能够支持更长的序列。 " +"ISP由于需要在attention前后分别进行All2All通信,在 ``sequence parallel`` 和 ``head " +"parallel`` 之间进行切换, 因此 ``sp size`` 自然受到 ``head number`` 的限制,无法进行扩展;而Ring-" +"Attention由于在attention计算过程中需要进行P2P通信,可能会遇到通信低效的问题。" +msgstr "" +"2D-Attention is the sequence parallelism within InternEvo, integrating both the Ring-Attention and ISP. " +" The sequence parallel size in ISP is constrained by the head number, limiting the scalability. " +" Meanwhile, the Ring-Attention may lead to suboptimal performance due to the inefficient P2P communication." + +#: ../../source/parallel.rst:219 +msgid "" +"2D-Attention将ISP和Ring-Attention相结合,组成一个混合序列并行,能够解除 ``sp size`` 小于等于 " +"``head number`` 的限制,同时避免P2P低效带宽利用。" +msgstr "" +"2D-Attention, by integrating ISP and Ring-Attention, not only overcomes the limitation that the sequence parallel size " +"must not exceed the head number, but also enhances the P2P communication efficiency." + +#: ../../source/parallel.rst:221 +msgid "" +"在2D-Attention中, ``sp size = hp size * cp size`` 。其中, ``hp size`` 为 ``head" +" parallel size`` , ``cp size`` 为 ``context parallel size`` (Ring-" +"Attention)。 下图展示了 ``hp=2`` , ``cp=4`` 的例子。" +msgstr "" +"In 2D-Attention, ``sp size = hp size * cp size``, where ``hp size`` represents the head parallel size, " +"``cp size`` denotes the context parallel size. The following figure shows an example where hp=2, cp=4." + +#: ../../source/parallel.rst:230 +msgid "" +"在上图中,不同颜色表示不同的head,在做第一个All2All之前,GPU0~3拥有两个head的前4个token; " +"GPU4~7拥有两个head的后4个token。在第一个All2All之后,GPU0~3拥有第一个head的所有token,且将第一个head的所有token切成4份" +",做Ring-Attention,GPU4~7同理;在第2个All2All之后,所有GPU又回到初始状态。" +msgstr "" +"In the above figure, different color represents different head. Before conducting the first All2All, GPU 0~3 process the first 4 tokens for two heads, " +"while GPU 4~7 hold the last 4 tokens of the same two heads. After the first All2All, GPU 0~3 receive all tokens for the first head " +"and divide these tokens into 4 segments to perform Ring-Attention. GPU 4~7 follow a similar process. " +"All GPUs return the initial states after the second All2All." + +#: ../../source/parallel.rst:233 +msgid "InternEvo针对2D-Attention做了一些更进一步的优化:" +msgstr "" +"InternEvo implements several optimizations for 2D-Attention to achieve additional performance enhancements." + +#: ../../source/parallel.rst:235 +msgid "" +"由于因果模型的限制,在Ring-Attention中会导致每个GPU的计算负载不均衡,因此InternEvo参考了 `zigzag " +"`_ ,在2D-" +"Attention中的 ``context parallel`` 使用了zigzag模式" +msgstr "" +"In Ring-Attention, the computation load for each GPU is unevenly distributed due to the causal model. " +"InternEvo addresses this imbalance by implementing zigzag in context parallel as described in `zigzag `_" + +#: ../../source/parallel.rst:236 +msgid "" +"为了充分利用集群的网卡资源,提高通信效率,2D-Attention在做 ``context parallel`` 的时候,引入了一个 " +"``window size`` 概念,即为Double-Ring Attention。下图展示了 ``cp=8`` , " +"``window_size=4`` 的例子。GPU 0~3和GPU 4~7内部分别做inner Ring " +"Attention,进行节点内P2P通信。GPU 0和4做Outer Ring " +"Attention,进行节点间P2P通信,网卡利用示意图如下图所示。" +msgstr "" +"2D-Attention introduces the window size concept in context parallel to optimize the use of NIC resources, known as Double-Ring Attention. " +"The following diagram illustrates an example where ``cp=8``, ``window_size=4``. GPU 0~3 and 4~7 perform the inner Ring Attention respectively, which involves intra-node communication. " +"GPU 0 and GPU 4 carry out the outer Ring Attention, which involves inter-node communication." + +#: ../../source/parallel.rst:242 +msgid "Double-Ring-Attention" +msgstr "Double-Ring-Attention" + +#: ../../source/parallel.rst:244 +msgid "" +"由于2D-Attention中同时涉及到 ``head parallel`` 和 ``context parallel`` " +",因此InternEvo提供了可配置选项,用于控制 ``head parallel`` 和 ``context parallel`` " +"创建通信组的优先级" +msgstr "" +"Since the 2D-Attention involves head parallel and context parallel, InternEvo provides the configuration options to manage" +"the priority of establishing communication groups for head parallel and context parallel." + +#: ../../source/parallel.rst:245 +msgid "" +"为了充分利用网卡资源,需要特别注意创建 ``context parallel`` 通信组。当 ``head parallel`` 优先创建通信组," +" ``context parallel`` 的GPU天然就是interleaved,这时天然能够利用网卡资源;当 ``context " +"parallel`` 优先创建通信组时,这些 ``context parallel`` " +"被分配到的GPU往往是连续的,为了提高通信效率,InternEvo提供了interleaved配置选项,可以在 ``window size > " +"1`` 的情况,重排 ``context parallel`` 的GPU。" +msgstr "" +"It should be noted that the establishment of process group for context parallel to take advantage of NIC resources. " +"When head parallel is given high priority, the GPUs assigned to context parallel are natively interleaved, which is beneficial for utilizing the NIC. " +"Conversely, when context parallel is prioritized, the arrangement of GPUs for context parallel tends to be sequential. " +"To improve the communication efficiency, InternEvo provides an interleaved configuration. " +"This allows for the reorganization of GPUs involved in context parallel when ``window size > 1``." + +#: ../../source/parallel.rst:247 +msgid "下图展示了一个Double-Ring-Attention充分利用网卡资源的示例。" +msgstr "The following figure shows an example of how Double-Ring-Attention optimizes the utilization of NIC resources." + +#: ../../source/parallel.rst:253 +msgid "Communication in Double-Ring-Attention" +msgstr "Communication in Double-Ring-Attention" + +#: ../../source/parallel.rst:255 +msgid "InternEvo在parallel config里面添加了sequence_2D用于配置2D-Attention。" +msgstr "InternEvo introduces the sequence_2D in parallel configuration to set up 2D-Attention." + +#: ../../source/parallel.rst:274 +msgid "``sequence_2D.enable`` 字段表示是否启用2D-Attention" +msgstr "``sequence_2D.enable`` indicates whether to employ 2D-Attention" + +#: ../../source/parallel.rst:276 +msgid "``sequence_2D.head_size`` 字段表示head parallel size" +msgstr "``sequence_2D.head_size`` denotes the head parallel size" + +#: ../../source/parallel.rst:278 +msgid "``sequence_2D.context_size`` 字段表示context parallel size" +msgstr "``sequence_2D.context_size`` represents the context parallel size" + +#: ../../source/parallel.rst:280 +msgid "``sequence_2D.window_size`` 字段表示Double-Ring Attention中的window_size" +msgstr "``sequence_2D.window_size`` indicates the windo_size in Double-Ring Attention" + +#: ../../source/parallel.rst:282 +msgid "" +"``sequence_2D.device_placement_strategy.head_first`` 字段表示是否优先分配head " +"parallel通信组,若为False,则为context-first" +msgstr "" +"``sequence_2D.device_placement_strategy.head_first`` determines whether to prioritize the establishment of the head parallel process group." +"If set to False, the context parallel is given priority instead." + +#: ../../source/parallel.rst:284 +msgid "" +"``sequence_2D.device_placement_strategy.interleavd`` 字段表示是否对context " +"parallel的GPU重排,该字段在 " +"``sequence_2D.device_placement_strategy.head_first=False`` 和 " +"``sequence_2D.window_size>1`` 时,推荐设置为 ``True``" +msgstr "" +"``sequence_2D.device_placement_strategy.interleavd`` determines whether to rearrange the GPUs for context parallel." +"It is recommend to set it to True when ``sequence_2D.device_placement_strategy.head_first=False`` and ``sequence_2D.window_size>1``." + +#: ../../source/parallel.rst:286 +msgid "" +"关于 2D-Attention更多的设计思路和性能评测,请参考论文 `LoongTrain: Efficient Training of " +"Long-Sequence LLMs with Head-Context Parallelism " +"`_" +msgstr "" +"For more design concepts and performance evaluations of 2D-Attention, please refer to the paper" +" `LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism `_" + #~ msgid "A tuple of (output, label, loss), loss and label could be None." #~ msgstr "" diff --git a/doc/code-docs/locales/en/LC_MESSAGES/training.po b/doc/code-docs/locales/en/LC_MESSAGES/training.po index 7ee2c17a..25b4a492 100644 --- a/doc/code-docs/locales/en/LC_MESSAGES/training.po +++ b/doc/code-docs/locales/en/LC_MESSAGES/training.po @@ -7,7 +7,7 @@ msgid "" msgstr "" "Project-Id-Version: InternLM \n" "Report-Msgid-Bugs-To: \n" -"POT-Creation-Date: 2024-08-30 16:09+0800\n" +"POT-Creation-Date: 2024-10-08 17:17+0800\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME \n" "Language: en\n" @@ -16,7 +16,7 @@ msgstr "" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=utf-8\n" "Content-Transfer-Encoding: 8bit\n" -"Generated-By: Babel 2.15.0\n" +"Generated-By: Babel 2.12.1\n" #: ../../source/training.rst:2 msgid "启动训练脚本" @@ -27,16 +27,19 @@ msgid "" "用户在安装了InternEvo之后,需要自行编写训练启动脚本,请参考: `train.py " "`_" msgstr "" -"After installing InternEvo, users need to write their own training startup scripts. Please refer to:" -" `train.py `_ " +"After installing InternEvo, users need to write their own training " +"startup scripts. Please refer to: `train.py " +"`_ " #: ../../source/training.rst:6 msgid "" "脚本中的流程可以分为三步:参数解析、初始化、启动训练。其中参数解析和初始化过程的具体原理参见: `训练初始化 " "`_" msgstr "" -"The process in the script can be divided into three steps: parameter parsing, initialization, and starting training." -" For the specific principles of parameter parsing and initialization, please refer to: `Training Initialization " +"The process in the script can be divided into three steps: parameter " +"parsing, initialization, and starting training. For the specific " +"principles of parameter parsing and initialization, please refer to: " +"`Training Initialization " "`_" #: ../../source/training.rst:9 @@ -49,8 +52,11 @@ msgid "" "`_" msgstr "" -"Call the parse_args function to parse the parameters set in the configuration file when starting the training. For more details, see:" -" `Argument Parsing `_" +"Call the parse_args function to parse the parameters set in the " +"configuration file when starting the training. For more details, see: " +"`Argument Parsing " +"`_" #: ../../source/training.rst:17 msgid "初始化过程" @@ -65,8 +71,10 @@ msgid "" "调用 ``initialize_distributed_env`` 函数,支持通过 slurm 或 torch " "方式启动训练脚本,并传入配置文件、端口号、进程随机种子等信息。函数详细说明如下:" msgstr "" -"Call the initialize_distributed_env function, which supports launching the training script through Slurm or Torch," -" and pass in information such as the configuration file, port number, and process random seed. Detailed description of the function is as follows:" +"Call the initialize_distributed_env function, which supports launching " +"the training script through Slurm or Torch, and pass in information such " +"as the configuration file, port number, and process random seed. Detailed" +" description of the function is as follows:" #: ../../source/training.rst:27 msgid "初始化模型" @@ -76,7 +84,10 @@ msgstr "Initialize Model" msgid "" "详细介绍请参考: `模型初始化 `_" -msgstr "Detailed introduction refer to: `Model Initialization `_" +msgstr "" +"Detailed introduction refer to: `Model Initialization " +"`_" #: ../../source/training.rst:34 msgid "初始化训练数据加载器" @@ -86,7 +97,10 @@ msgstr "Initialize Training Dataloader" msgid "" "详细介绍请参考: `数据加载器初始化 `_" -msgstr "Detailed introduction refer to: `Dataloader Initialization `_" +msgstr "" +"Detailed introduction refer to: `Dataloader Initialization " +"`_" #: ../../source/training.rst:41 msgid "初始化验证数据加载器" @@ -94,7 +108,10 @@ msgstr "Initialize Validation Dataloader" #: ../../source/training.rst:46 msgid "初始化验证数据加载器,加载过程与训练数据加载类似,通过配置文件中的 ``VALID_FOLDER `` 字段设置验证数据集路径。" -msgstr "Initialize the validation data loader, which has a loading process similar to that of the training data. The path to the validation dataset is set through the VALID_FOLDER field in the configuration file." +msgstr "" +"Initialize the validation data loader, which has a loading process " +"similar to that of the training data. The path to the validation dataset " +"is set through the VALID_FOLDER field in the configuration file." #: ../../source/training.rst:48 msgid "初始化Trainer" @@ -106,8 +123,11 @@ msgid "" "``internlm.core.trainer.Trainer`` 管理。在定义了训练引擎和调度器之后,我们可以调用 Trainer API " "来执行模型训练、评估、梯度清零和参数更新等。" msgstr "" -"The TrainerBuilder interface inherits from the Trainer class, and the training API of InternEvo is managed by internlm.core.trainer.Trainer." -" After defining the training engine and scheduler, we can call the Trainer API to perform model training, evaluation, gradient clearing, and parameter updating, etc." +"The TrainerBuilder interface inherits from the Trainer class, and the " +"training API of InternEvo is managed by internlm.core.trainer.Trainer. " +"After defining the training engine and scheduler, we can call the Trainer" +" API to perform model training, evaluation, gradient clearing, and " +"parameter updating, etc." #: ../../source/training.rst:55 msgid "有关详细用法,请参阅 Trainer API 文档和示例。" @@ -115,13 +135,94 @@ msgstr "" "For detailed usage, please refer to Trainer API documentation and " "examples." +#: internlm.core.trainer.Trainer:1 of +msgid "" +"This is a class tending for easy deployments of users' training and " +"evaluation instead of writing their own scripts." +msgstr "" + +#: internlm.core.trainer.Trainer internlm.core.trainer.Trainer.execute_schedule +#: of +msgid "参数" +msgstr "" + +#: internlm.core.trainer.Trainer:4 of +msgid "Engine responsible for the process function." +msgstr "" + +#: internlm.core.trainer.Trainer:6 of +msgid "Runtime schedule. Defaults to None." +msgstr "" + +#: internlm.core.trainer.Trainer.engine:1 of +msgid "" +"Returns the engine that responsible for managing the training and " +"evaluation process." +msgstr "" + +#: internlm.core.trainer.Trainer.schedule:1 of +msgid "Returns the runtime scheduler." +msgstr "" + +#: internlm.core.trainer.Trainer.uses_pipeline:1 of +msgid "Returns whether the pipeline parallel is used or not." +msgstr "" + +#: internlm.core.trainer.Trainer.train:1 of +msgid "Sets the model to training mode." +msgstr "" + +#: internlm.core.trainer.Trainer.eval:1 of +msgid "Sets the model to evaluation mode." +msgstr "" + +#: internlm.core.trainer.Trainer.zero_grad:1 of +msgid "Sets the gradient of all parameters in the model to zero." +msgstr "" + +#: internlm.core.trainer.Trainer.step:1 of +msgid "Executes the parameter update step." +msgstr "" + +#: internlm.core.trainer.Trainer.execute_schedule:1 of +msgid "" +"Runs the forward, loss computation, and backward for the model. Returns a" +" tuple of (output, label, loss)." +msgstr "" + +#: internlm.core.trainer.Trainer.execute_schedule:4 of +msgid "The data iterator." +msgstr "" + +#: internlm.core.trainer.Trainer.execute_schedule:6 of +msgid "Additional keyword arguments." +msgstr "" + +#: internlm.core.trainer.Trainer.execute_schedule of +msgid "返回" +msgstr "" + +#: internlm.core.trainer.Trainer.execute_schedule:8 of +msgid "A tuple of (output, label, loss, moe_loss)." +msgstr "" + +#: internlm.core.trainer.Trainer.execute_schedule of +msgid "返回类型" +msgstr "" + +#: internlm.core.trainer.Trainer.execute_schedule:9 of +msgid "Tuple[:class:`torch.Tensor`]" +msgstr "" + #: ../../source/training.rst:61 msgid "启动训练过程" msgstr "Start Training Process" #: ../../source/training.rst:66 msgid "首先,通过 ``self.train()`` 方法,将模型设置为training状态。" -msgstr "Firstly, by using the self.train() method, the model is set to training mode." +msgstr "" +"Firstly, by using the self.train() method, the model is set to training " +"mode." #: ../../source/training.rst:68 msgid "" @@ -131,13 +232,15 @@ msgid "" "对模型训练结果进行评估。 最后,如果开启了保存ckpt功能,通过 ``try_save_checkpoint`` " "函数保留训练中间状态以及最终训练结果。" msgstr "" -"During each step of the training process, the load_new_batch function is used to load the dataset." -" Then, the execute_schedule scheduler is used to initiate training, and the forward_backward_step " -"begins the forward and backward training process. Afterwards, the self.step() updates the parameters" -" and returns the gradient values. If the step count reaches the number required for validation," -" the model's training results are evaluated using evaluate_on_val_dls. Finally, if the checkpoint" -" saving function is enabled, the intermediate training state and the final training results are" -" saved using the try_save_checkpoint function." +"During each step of the training process, the load_new_batch function is " +"used to load the dataset. Then, the execute_schedule scheduler is used to" +" initiate training, and the forward_backward_step begins the forward and " +"backward training process. Afterwards, the self.step() updates the " +"parameters and returns the gradient values. If the step count reaches the" +" number required for validation, the model's training results are " +"evaluated using evaluate_on_val_dls. Finally, if the checkpoint saving " +"function is enabled, the intermediate training state and the final " +"training results are saved using the try_save_checkpoint function." #~ msgid "InternLM 的训练流程可以归纳为两个步骤:" #~ msgstr "The training process of InternLM can be summarized into two steps: " diff --git a/doc/code-docs/source/parallel.rst b/doc/code-docs/source/parallel.rst index 1ad9ff63..9b162286 100644 --- a/doc/code-docs/source/parallel.rst +++ b/doc/code-docs/source/parallel.rst @@ -209,3 +209,78 @@ ZeRO1.5 的实现使用了分层分片的概念,通过配置值 ``parallel.zer .. autoclass:: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer :members: + +2D-Attention +----------------- +2D-Attention是InternEvo系统扩展ISP的序列化并行方案,集成了Ring-Attention和ISP,能够支持更长的序列。 +ISP由于需要在attention前后分别进行All2All通信,在 ``sequence parallel`` 和 ``head parallel`` 之间进行切换, +因此 ``sp size`` 自然受到 ``head number`` 的限制,无法进行扩展;而Ring-Attention由于在attention计算过程中需要进行P2P通信,可能会遇到通信低效的问题。 + +2D-Attention将ISP和Ring-Attention相结合,组成一个混合序列并行,能够解除 ``sp size`` 小于等于 ``head number`` 的限制,同时避免P2P低效带宽利用。 + +在2D-Attention中, ``sp size = hp size * cp size`` 。其中, ``hp size`` 为 ``head parallel size`` , ``cp size`` 为 ``context parallel size`` (Ring-Attention)。 +下图展示了 ``hp=2`` , ``cp=4`` 的例子。 + +.. figure:: ../../imgs/2d-attn.PNG + :scale: 80% + :class: with-border + + 2D-Attention + +在上图中,不同颜色表示不同的head,在做第一个All2All之前,GPU0~3拥有两个head的前4个token; +GPU4~7拥有两个head的后4个token。在第一个All2All之后,GPU0~3拥有第一个head的所有token,且将第一个head的所有token切成4份,做Ring-Attention,GPU4~7同理;在第2个All2All之后,所有GPU又回到初始状态。 + +InternEvo针对2D-Attention做了一些更进一步的优化: + +- 1. 由于因果模型的限制,在Ring-Attention中会导致每个GPU的计算负载不均衡,因此InternEvo参考了 `zigzag `_ ,在2D-Attention中的 ``context parallel`` 使用了zigzag模式 +- 2. 为了充分利用集群的网卡资源,提高通信效率,2D-Attention在做 ``context parallel`` 的时候,引入了一个 ``window size`` 概念,即为Double-Ring Attention。下图展示了 ``cp=8`` , ``window_size=4`` 的例子。GPU 0~3和GPU 4~7内部分别做inner Ring Attention,进行节点内P2P通信。GPU 0和4做Outer Ring Attention,进行节点间P2P通信,网卡利用示意图如下图所示。 + +.. figure:: ../../imgs/double-ring.PNG + :scale: 80% + :class: with-border + + Double-Ring-Attention + +- 3. 由于2D-Attention中同时涉及到 ``head parallel`` 和 ``context parallel`` ,因此InternEvo提供了可配置选项,用于控制 ``head parallel`` 和 ``context parallel`` 创建通信组的优先级 +- 4. 为了充分利用网卡资源,需要特别注意创建 ``context parallel`` 通信组。当 ``head parallel`` 优先创建通信组, ``context parallel`` 的GPU天然就是interleaved,这时天然能够利用网卡资源;当 ``context parallel`` 优先创建通信组时,这些 ``context parallel`` 被分配到的GPU往往是连续的,为了提高通信效率,InternEvo提供了interleaved配置选项,可以在 ``window size > 1`` 的情况,重排 ``context parallel`` 的GPU。 + +下图展示了一个Double-Ring-Attention充分利用网卡资源的示例。 + +.. figure:: ../../imgs/nic.PNG + :scale: 80% + :class: with-border + + Communication in Double-Ring-Attention + +InternEvo在parallel config里面添加了sequence_2D用于配置2D-Attention。 + +.. code-block:: python + + parallel = dict( + zero1=dict(size=-1), + tensor=dict(size=2, mode="isp"), + pipeline=dict(size=1, interleaved_overlap=True), + weight=dict(size=4, overlap=True, memory_pool=False), + sequence_2D=dict( + enable=False, + head_size=2, + context_size=4, + window_size=1, + device_placement_strategy=dict(head_first=True, interleaved=False), + ), + ) + + +``sequence_2D.enable`` 字段表示是否启用2D-Attention + +``sequence_2D.head_size`` 字段表示head parallel size + +``sequence_2D.context_size`` 字段表示context parallel size + +``sequence_2D.window_size`` 字段表示Double-Ring Attention中的window_size + +``sequence_2D.device_placement_strategy.head_first`` 字段表示是否优先分配head parallel通信组,若为False,则为context-first + +``sequence_2D.device_placement_strategy.interleavd`` 字段表示是否对context parallel的GPU重排,该字段在 ``sequence_2D.device_placement_strategy.head_first=False`` 和 ``sequence_2D.window_size>1`` 时,推荐设置为 ``True`` + +关于 2D-Attention更多的设计思路和性能评测,请参考论文 `LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism `_ diff --git a/doc/imgs/2d-attn.PNG b/doc/imgs/2d-attn.PNG new file mode 100644 index 00000000..e8397464 Binary files /dev/null and b/doc/imgs/2d-attn.PNG differ diff --git a/doc/imgs/double-ring.PNG b/doc/imgs/double-ring.PNG new file mode 100644 index 00000000..4855d8ef Binary files /dev/null and b/doc/imgs/double-ring.PNG differ diff --git a/doc/imgs/nic.PNG b/doc/imgs/nic.PNG new file mode 100644 index 00000000..5c9209b3 Binary files /dev/null and b/doc/imgs/nic.PNG differ