diff --git a/doc/code-docs/locales/en/LC_MESSAGES/parallel.po b/doc/code-docs/locales/en/LC_MESSAGES/parallel.po
index b948e4f9..ab842d15 100644
--- a/doc/code-docs/locales/en/LC_MESSAGES/parallel.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/parallel.po
@@ -7,7 +7,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2024-08-30 15:51+0800\n"
+"POT-Creation-Date: 2024-10-08 17:17+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: en\n"
@@ -16,7 +16,7 @@ msgstr ""
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.15.0\n"
+"Generated-By: Babel 2.12.1\n"
 
 #: ../../source/parallel.rst:2
 msgid "并行模式与原理"
@@ -274,15 +274,15 @@ msgstr ""
 " sequence parallelism and tensor parallelism, performing an ``all-"
 "gather`` operation along the seqlen dimension of activation values. After"
 " this communication is completed, the shape of the activation values "
-"becomes the full ``[seqlen, hidden_size]`` , and then it enters the scope "
-"of the tensor parallelism module. The communication of ``ḡ`` is situated "
-"at the junction of tensor parallelism and sequence parallelism, requiring"
-" the transformation of the ``all-reduce`` communication operation from "
-"MTP into a ``reduce-scatter`` operation to achieve the split along the "
-"seqlen dimension. This results in the activation values having a shape of"
-" ``[seqlen/tp, hidden_size]`` , enabling a smooth transition into the "
-"sequence parallelism phase. The same principles apply during the backward"
-" pass."
+"becomes the full ``[seqlen, hidden_size]`` , and then it enters the scope"
+" of the tensor parallelism module. The communication of ``ḡ`` is situated"
+" at the junction of tensor parallelism and sequence parallelism, "
+"requiring the transformation of the ``all-reduce`` communication "
+"operation from MTP into a ``reduce-scatter`` operation to achieve the "
+"split along the seqlen dimension. This results in the activation values "
+"having a shape of ``[seqlen/tp, hidden_size]`` , enabling a smooth "
+"transition into the sequence parallelism phase. The same principles apply"
+" during the backward pass."
 
 #: ../../source/parallel.rst:85
 msgid "FSP"
@@ -448,7 +448,13 @@ msgid ""
 ":class:`NonPipelineSchedule`."
 msgstr ""
 
-#: ../../source/parallel.rst
+#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
+#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler
+#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
+#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.pre_processing
+#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.belongs_to_current_rank
+#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step
+#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.zero_grad of
 msgid "参数"
 msgstr "Parameter"
 
@@ -532,7 +538,10 @@ msgstr ""
 msgid "If False, the output and label won't be returned."
 msgstr ""
 
-#: ../../source/parallel.rst
+#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
+#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
+#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.belongs_to_current_rank
+#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step of
 msgid "返回"
 msgstr ""
 
@@ -559,7 +568,10 @@ msgid ""
 "accumulated from all stages."
 msgstr ""
 
-#: ../../source/parallel.rst
+#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
+#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
+#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.belongs_to_current_rank
+#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step of
 msgid "返回类型"
 msgstr "Return type"
 
@@ -745,6 +757,157 @@ msgstr ""
 msgid "Whether the gradient is success updated, and the gradient."
 msgstr ""
 
+#: ../../source/parallel.rst:214 ../../source/parallel.rst:228
+msgid "2D-Attention"
+msgstr "2D-Attention"
+
+#: ../../source/parallel.rst:215
+msgid ""
+"2D-Attention是InternEvo系统扩展ISP的序列化并行方案，集成了Ring-Attention和ISP，能够支持更长的序列。 "
+"ISP由于需要在attention前后分别进行All2All通信，在 ``sequence parallel`` 和 ``head "
+"parallel`` 之间进行切换， 因此 ``sp size`` 自然受到 ``head number`` 的限制，无法进行扩展；而Ring-"
+"Attention由于在attention计算过程中需要进行P2P通信，可能会遇到通信低效的问题。"
+msgstr ""
+"2D-Attention is the sequence parallelism within InternEvo, integrating both the Ring-Attention and ISP. "
+" The sequence parallel size in ISP is constrained by the head number, limiting the scalability. "
+" Meanwhile, the Ring-Attention may lead to suboptimal performance due to the inefficient P2P communication."
+
+#: ../../source/parallel.rst:219
+msgid ""
+"2D-Attention将ISP和Ring-Attention相结合，组成一个混合序列并行，能够解除 ``sp size`` 小于等于 "
+"``head number`` 的限制，同时避免P2P低效带宽利用。"
+msgstr ""
+"2D-Attention, by integrating ISP and Ring-Attention, not only overcomes the limitation that the sequence parallel size "
+"must not exceed the head number, but also enhances the P2P communication efficiency."
+
+#: ../../source/parallel.rst:221
+msgid ""
+"在2D-Attention中， ``sp size = hp size * cp size`` 。其中， ``hp size`` 为 ``head"
+" parallel size`` ， ``cp size`` 为 ``context parallel size`` （Ring-"
+"Attention）。 下图展示了 ``hp=2`` ， ``cp=4`` 的例子。"
+msgstr ""
+"In 2D-Attention, ``sp size = hp size * cp size``, where ``hp size`` represents the head parallel size, "
+"``cp size`` denotes the context parallel size. The following figure shows an example where hp=2, cp=4."
+
+#: ../../source/parallel.rst:230
+msgid ""
+"在上图中，不同颜色表示不同的head，在做第一个All2All之前，GPU0~3拥有两个head的前4个token； "
+"GPU4~7拥有两个head的后4个token。在第一个All2All之后，GPU0~3拥有第一个head的所有token，且将第一个head的所有token切成4份"
+"，做Ring-Attention，GPU4~7同理；在第2个All2All之后，所有GPU又回到初始状态。"
+msgstr ""
+"In the above figure, different color represents different head. Before conducting the first All2All, GPU 0~3 process the first 4 tokens for two heads, "
+"while GPU 4~7 hold the last 4 tokens of the same two heads. After the first All2All, GPU 0~3 receive all tokens for the first head "
+"and divide these tokens into 4 segments to perform Ring-Attention. GPU 4~7 follow a similar process. "
+"All GPUs return the initial states after the second All2All."
+
+#: ../../source/parallel.rst:233
+msgid "InternEvo针对2D-Attention做了一些更进一步的优化："
+msgstr ""
+"InternEvo implements several optimizations for 2D-Attention to achieve additional performance enhancements."
+
+#: ../../source/parallel.rst:235
+msgid ""
+"由于因果模型的限制，在Ring-Attention中会导致每个GPU的计算负载不均衡，因此InternEvo参考了 `zigzag "
+"<https://github.com/zhuzilin/ring-flash-attention/issues/2>`_ ，在2D-"
+"Attention中的 ``context parallel`` 使用了zigzag模式"
+msgstr ""
+"In Ring-Attention, the computation load for each GPU is unevenly distributed due to the causal model. "
+"InternEvo addresses this imbalance by implementing zigzag in context parallel as described in `zigzag <https://github.com/zhuzilin/ring-flash-attention/issues/2>`_"
+
+#: ../../source/parallel.rst:236
+msgid ""
+"为了充分利用集群的网卡资源，提高通信效率，2D-Attention在做 ``context parallel`` 的时候，引入了一个 "
+"``window size`` 概念，即为Double-Ring Attention。下图展示了 ``cp=8`` ， "
+"``window_size=4`` 的例子。GPU 0~3和GPU 4~7内部分别做inner Ring "
+"Attention，进行节点内P2P通信。GPU 0和4做Outer Ring "
+"Attention，进行节点间P2P通信，网卡利用示意图如下图所示。"
+msgstr ""
+"2D-Attention introduces the window size concept in context parallel to optimize the use of NIC resources, known as Double-Ring Attention. "
+"The following diagram illustrates an example where ``cp=8``, ``window_size=4``. GPU 0~3 and 4~7 perform the inner Ring Attention respectively, which involves intra-node communication. "
+"GPU 0 and GPU 4 carry out the outer Ring Attention, which involves inter-node communication."
+
+#: ../../source/parallel.rst:242
+msgid "Double-Ring-Attention"
+msgstr "Double-Ring-Attention"
+
+#: ../../source/parallel.rst:244
+msgid ""
+"由于2D-Attention中同时涉及到 ``head parallel`` 和 ``context parallel`` "
+"，因此InternEvo提供了可配置选项，用于控制 ``head parallel`` 和 ``context parallel`` "
+"创建通信组的优先级"
+msgstr ""
+"Since the 2D-Attention involves head parallel and context parallel, InternEvo provides the configuration options to manage"
+"the priority of establishing communication groups for head parallel and context parallel."
+
+#: ../../source/parallel.rst:245
+msgid ""
+"为了充分利用网卡资源，需要特别注意创建 ``context parallel`` 通信组。当 ``head parallel`` 优先创建通信组，"
+" ``context parallel`` 的GPU天然就是interleaved，这时天然能够利用网卡资源；当 ``context "
+"parallel`` 优先创建通信组时，这些 ``context parallel`` "
+"被分配到的GPU往往是连续的，为了提高通信效率，InternEvo提供了interleaved配置选项，可以在 ``window size > "
+"1`` 的情况，重排 ``context parallel`` 的GPU。"
+msgstr ""
+"It should be noted that the establishment of process group for context parallel to take advantage of NIC resources. "
+"When head parallel is given high priority, the GPUs assigned to context parallel are natively interleaved, which is beneficial for utilizing the NIC. "
+"Conversely, when context parallel is prioritized, the arrangement of GPUs for context parallel tends to be sequential. "
+"To improve the communication efficiency, InternEvo provides an interleaved configuration. "
+"This allows for the reorganization of GPUs involved in context parallel when ``window size > 1``."
+
+#: ../../source/parallel.rst:247
+msgid "下图展示了一个Double-Ring-Attention充分利用网卡资源的示例。"
+msgstr "The following figure shows an example of how Double-Ring-Attention optimizes the utilization of NIC resources."
+
+#: ../../source/parallel.rst:253
+msgid "Communication in Double-Ring-Attention"
+msgstr "Communication in Double-Ring-Attention"
+
+#: ../../source/parallel.rst:255
+msgid "InternEvo在parallel config里面添加了sequence_2D用于配置2D-Attention。"
+msgstr "InternEvo introduces the sequence_2D in parallel configuration to set up 2D-Attention."
+
+#: ../../source/parallel.rst:274
+msgid "``sequence_2D.enable`` 字段表示是否启用2D-Attention"
+msgstr "``sequence_2D.enable`` indicates whether to employ 2D-Attention"
+
+#: ../../source/parallel.rst:276
+msgid "``sequence_2D.head_size`` 字段表示head parallel size"
+msgstr "``sequence_2D.head_size`` denotes the head parallel size"
+
+#: ../../source/parallel.rst:278
+msgid "``sequence_2D.context_size`` 字段表示context parallel size"
+msgstr "``sequence_2D.context_size`` represents the context parallel size"
+
+#: ../../source/parallel.rst:280
+msgid "``sequence_2D.window_size`` 字段表示Double-Ring Attention中的window_size"
+msgstr "``sequence_2D.window_size`` indicates the windo_size in Double-Ring Attention"
+
+#: ../../source/parallel.rst:282
+msgid ""
+"``sequence_2D.device_placement_strategy.head_first`` 字段表示是否优先分配head "
+"parallel通信组，若为False，则为context-first"
+msgstr ""
+"``sequence_2D.device_placement_strategy.head_first`` determines whether to prioritize the establishment of the head parallel process group."
+"If set to False, the context parallel is given priority instead."
+
+#: ../../source/parallel.rst:284
+msgid ""
+"``sequence_2D.device_placement_strategy.interleavd`` 字段表示是否对context "
+"parallel的GPU重排，该字段在 "
+"``sequence_2D.device_placement_strategy.head_first=False`` 和 "
+"``sequence_2D.window_size>1`` 时，推荐设置为 ``True``"
+msgstr ""
+"``sequence_2D.device_placement_strategy.interleavd`` determines whether to rearrange the GPUs for context parallel."
+"It is recommend to set it to True when ``sequence_2D.device_placement_strategy.head_first=False`` and ``sequence_2D.window_size>1``."
+
+#: ../../source/parallel.rst:286
+msgid ""
+"关于 2D-Attention更多的设计思路和性能评测，请参考论文 `LoongTrain: Efficient Training of "
+"Long-Sequence LLMs with Head-Context Parallelism "
+"<https://arxiv.org/pdf/2406.18485>`_"
+msgstr ""
+"For more design concepts and performance evaluations of 2D-Attention, please refer to the paper"
+" `LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism <https://arxiv.org/pdf/2406.18485>`_"
+
 #~ msgid "A tuple of (output, label, loss), loss and label could be None."
 #~ msgstr ""
 
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/training.po b/doc/code-docs/locales/en/LC_MESSAGES/training.po
index 7ee2c17a..25b4a492 100644
--- a/doc/code-docs/locales/en/LC_MESSAGES/training.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/training.po
@@ -7,7 +7,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2024-08-30 16:09+0800\n"
+"POT-Creation-Date: 2024-10-08 17:17+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: en\n"
@@ -16,7 +16,7 @@ msgstr ""
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.15.0\n"
+"Generated-By: Babel 2.12.1\n"
 
 #: ../../source/training.rst:2
 msgid "启动训练脚本"
@@ -27,16 +27,19 @@ msgid ""
 "用户在安装了InternEvo之后，需要自行编写训练启动脚本，请参考： `train.py "
 "<https://github.com/InternLM/InternEvo/blob/develop/train.py>`_"
 msgstr ""
-"After installing InternEvo, users need to write their own training startup scripts. Please refer to:"
-" `train.py <https://github.com/InternLM/InternEvo/blob/develop/train.py>`_ "
+"After installing InternEvo, users need to write their own training "
+"startup scripts. Please refer to: `train.py "
+"<https://github.com/InternLM/InternEvo/blob/develop/train.py>`_ "
 
 #: ../../source/training.rst:6
 msgid ""
 "脚本中的流程可以分为三步：参数解析、初始化、启动训练。其中参数解析和初始化过程的具体原理参见： `训练初始化 "
 "<https://internevo.readthedocs.io/zh-cn/latest/initialize.html>`_"
 msgstr ""
-"The process in the script can be divided into three steps: parameter parsing, initialization, and starting training."
-" For the specific principles of parameter parsing and initialization, please refer to: `Training Initialization "
+"The process in the script can be divided into three steps: parameter "
+"parsing, initialization, and starting training. For the specific "
+"principles of parameter parsing and initialization, please refer to: "
+"`Training Initialization "
 "<https://internevo.readthedocs.io/en/latest/initialize.html>`_"
 
 #: ../../source/training.rst:9
@@ -49,8 +52,11 @@ msgid ""
 "<https://internevo.readthedocs.io/zh-cn/latest/initialize.html#internlm-"
 "args>`_"
 msgstr ""
-"Call the parse_args function to parse the parameters set in the configuration file when starting the training. For more details, see:"
-" `Argument Parsing <https://internevo.readthedocs.io/en/latest/initialize.html#internlm-args>`_"
+"Call the parse_args function to parse the parameters set in the "
+"configuration file when starting the training. For more details, see: "
+"`Argument Parsing "
+"<https://internevo.readthedocs.io/en/latest/initialize.html#internlm-"
+"args>`_"
 
 #: ../../source/training.rst:17
 msgid "初始化过程"
@@ -65,8 +71,10 @@ msgid ""
 "调用 ``initialize_distributed_env`` 函数，支持通过 slurm 或 torch "
 "方式启动训练脚本，并传入配置文件、端口号、进程随机种子等信息。函数详细说明如下："
 msgstr ""
-"Call the initialize_distributed_env function, which supports launching the training script through Slurm or Torch,"
-" and pass in information such as the configuration file, port number, and process random seed. Detailed description of the function is as follows:"
+"Call the initialize_distributed_env function, which supports launching "
+"the training script through Slurm or Torch, and pass in information such "
+"as the configuration file, port number, and process random seed. Detailed"
+" description of the function is as follows:"
 
 #: ../../source/training.rst:27
 msgid "初始化模型"
@@ -76,7 +84,10 @@ msgstr "Initialize Model"
 msgid ""
 "详细介绍请参考： `模型初始化 <https://internevo.readthedocs.io/zh-"
 "cn/latest/initialize.html#internlm-model-init>`_"
-msgstr "Detailed introduction refer to: `Model Initialization <https://internevo.readthedocs.io/en/latest/initialize.html#internlm-model-init>`_"
+msgstr ""
+"Detailed introduction refer to: `Model Initialization "
+"<https://internevo.readthedocs.io/en/latest/initialize.html#internlm-"
+"model-init>`_"
 
 #: ../../source/training.rst:34
 msgid "初始化训练数据加载器"
@@ -86,7 +97,10 @@ msgstr "Initialize Training Dataloader"
 msgid ""
 "详细介绍请参考： `数据加载器初始化 <https://internevo.readthedocs.io/zh-"
 "cn/latest/initialize.html#internlm-dl-init>`_"
-msgstr "Detailed introduction refer to: `Dataloader Initialization <https://internevo.readthedocs.io/en/latest/initialize.html#internlm-dl-init>`_"
+msgstr ""
+"Detailed introduction refer to: `Dataloader Initialization "
+"<https://internevo.readthedocs.io/en/latest/initialize.html#internlm-dl-"
+"init>`_"
 
 #: ../../source/training.rst:41
 msgid "初始化验证数据加载器"
@@ -94,7 +108,10 @@ msgstr "Initialize Validation Dataloader"
 
 #: ../../source/training.rst:46
 msgid "初始化验证数据加载器，加载过程与训练数据加载类似，通过配置文件中的 ``VALID_FOLDER `` 字段设置验证数据集路径。"
-msgstr "Initialize the validation data loader, which has a loading process similar to that of the training data. The path to the validation dataset is set through the VALID_FOLDER field in the configuration file."
+msgstr ""
+"Initialize the validation data loader, which has a loading process "
+"similar to that of the training data. The path to the validation dataset "
+"is set through the VALID_FOLDER field in the configuration file."
 
 #: ../../source/training.rst:48
 msgid "初始化Trainer"
@@ -106,8 +123,11 @@ msgid ""
 "``internlm.core.trainer.Trainer`` 管理。在定义了训练引擎和调度器之后，我们可以调用 Trainer API "
 "来执行模型训练、评估、梯度清零和参数更新等。"
 msgstr ""
-"The TrainerBuilder interface inherits from the Trainer class, and the training API of InternEvo is managed by internlm.core.trainer.Trainer."
-" After defining the training engine and scheduler, we can call the Trainer API to perform model training, evaluation, gradient clearing, and parameter updating, etc."
+"The TrainerBuilder interface inherits from the Trainer class, and the "
+"training API of InternEvo is managed by internlm.core.trainer.Trainer. "
+"After defining the training engine and scheduler, we can call the Trainer"
+" API to perform model training, evaluation, gradient clearing, and "
+"parameter updating, etc."
 
 #: ../../source/training.rst:55
 msgid "有关详细用法，请参阅 Trainer API 文档和示例。"
@@ -115,13 +135,94 @@ msgstr ""
 "For detailed usage, please refer to Trainer API documentation and "
 "examples."
 
+#: internlm.core.trainer.Trainer:1 of
+msgid ""
+"This is a class tending for easy deployments of users' training and "
+"evaluation instead of writing their own scripts."
+msgstr ""
+
+#: internlm.core.trainer.Trainer internlm.core.trainer.Trainer.execute_schedule
+#: of
+msgid "参数"
+msgstr ""
+
+#: internlm.core.trainer.Trainer:4 of
+msgid "Engine responsible for the process function."
+msgstr ""
+
+#: internlm.core.trainer.Trainer:6 of
+msgid "Runtime schedule. Defaults to None."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.engine:1 of
+msgid ""
+"Returns the engine that responsible for managing the training and "
+"evaluation process."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.schedule:1 of
+msgid "Returns the runtime scheduler."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.uses_pipeline:1 of
+msgid "Returns whether the pipeline parallel is used or not."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.train:1 of
+msgid "Sets the model to training mode."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.eval:1 of
+msgid "Sets the model to evaluation mode."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.zero_grad:1 of
+msgid "Sets the gradient of all parameters in the model to zero."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.step:1 of
+msgid "Executes the parameter update step."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.execute_schedule:1 of
+msgid ""
+"Runs the forward, loss computation, and backward for the model. Returns a"
+" tuple of (output, label, loss)."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.execute_schedule:4 of
+msgid "The data iterator."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.execute_schedule:6 of
+msgid "Additional keyword arguments."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.execute_schedule of
+msgid "返回"
+msgstr ""
+
+#: internlm.core.trainer.Trainer.execute_schedule:8 of
+msgid "A tuple of (output, label, loss, moe_loss)."
+msgstr ""
+
+#: internlm.core.trainer.Trainer.execute_schedule of
+msgid "返回类型"
+msgstr ""
+
+#: internlm.core.trainer.Trainer.execute_schedule:9 of
+msgid "Tuple[:class:`torch.Tensor`]"
+msgstr ""
+
 #: ../../source/training.rst:61
 msgid "启动训练过程"
 msgstr "Start Training Process"
 
 #: ../../source/training.rst:66
 msgid "首先，通过 ``self.train()`` 方法，将模型设置为training状态。"
-msgstr "Firstly, by using the self.train() method, the model is set to training mode."
+msgstr ""
+"Firstly, by using the self.train() method, the model is set to training "
+"mode."
 
 #: ../../source/training.rst:68
 msgid ""
@@ -131,13 +232,15 @@ msgid ""
 "对模型训练结果进行评估。 最后，如果开启了保存ckpt功能，通过 ``try_save_checkpoint`` "
 "函数保留训练中间状态以及最终训练结果。"
 msgstr ""
-"During each step of the training process, the load_new_batch function is used to load the dataset."
-" Then, the execute_schedule scheduler is used to initiate training, and the forward_backward_step "
-"begins the forward and backward training process. Afterwards, the self.step() updates the parameters"
-" and returns the gradient values. If the step count reaches the number required for validation,"
-" the model's training results are evaluated using evaluate_on_val_dls. Finally, if the checkpoint"
-" saving function is enabled, the intermediate training state and the final training results are"
-" saved using the try_save_checkpoint function."
+"During each step of the training process, the load_new_batch function is "
+"used to load the dataset. Then, the execute_schedule scheduler is used to"
+" initiate training, and the forward_backward_step begins the forward and "
+"backward training process. Afterwards, the self.step() updates the "
+"parameters and returns the gradient values. If the step count reaches the"
+" number required for validation, the model's training results are "
+"evaluated using evaluate_on_val_dls. Finally, if the checkpoint saving "
+"function is enabled, the intermediate training state and the final "
+"training results are saved using the try_save_checkpoint function."
 
 #~ msgid "InternLM 的训练流程可以归纳为两个步骤："
 #~ msgstr "The training process of InternLM can be summarized into two steps: "
diff --git a/doc/code-docs/source/parallel.rst b/doc/code-docs/source/parallel.rst
index 1ad9ff63..9b162286 100644
--- a/doc/code-docs/source/parallel.rst
+++ b/doc/code-docs/source/parallel.rst
@@ -209,3 +209,78 @@ ZeRO1.5 的实现使用了分层分片的概念，通过配置值 ``parallel.zer
 
 .. autoclass:: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer
     :members:
+
+2D-Attention
+-----------------
+2D-Attention是InternEvo系统扩展ISP的序列化并行方案，集成了Ring-Attention和ISP，能够支持更长的序列。
+ISP由于需要在attention前后分别进行All2All通信，在 ``sequence parallel`` 和 ``head parallel`` 之间进行切换，
+因此 ``sp size`` 自然受到 ``head number`` 的限制，无法进行扩展；而Ring-Attention由于在attention计算过程中需要进行P2P通信，可能会遇到通信低效的问题。
+
+2D-Attention将ISP和Ring-Attention相结合，组成一个混合序列并行，能够解除 ``sp size`` 小于等于 ``head number`` 的限制，同时避免P2P低效带宽利用。
+
+在2D-Attention中， ``sp size = hp size * cp size`` 。其中， ``hp size`` 为 ``head parallel size`` ， ``cp size`` 为 ``context parallel size`` （Ring-Attention）。
+下图展示了 ``hp=2`` ， ``cp=4`` 的例子。
+
+.. figure:: ../../imgs/2d-attn.PNG
+  :scale: 80%
+  :class: with-border
+  
+  2D-Attention
+
+在上图中，不同颜色表示不同的head，在做第一个All2All之前，GPU0~3拥有两个head的前4个token；
+GPU4~7拥有两个head的后4个token。在第一个All2All之后，GPU0~3拥有第一个head的所有token，且将第一个head的所有token切成4份，做Ring-Attention，GPU4~7同理；在第2个All2All之后，所有GPU又回到初始状态。
+
+InternEvo针对2D-Attention做了一些更进一步的优化：
+
+- 1. 由于因果模型的限制，在Ring-Attention中会导致每个GPU的计算负载不均衡，因此InternEvo参考了 `zigzag <https://github.com/zhuzilin/ring-flash-attention/issues/2>`_ ，在2D-Attention中的 ``context parallel`` 使用了zigzag模式
+- 2. 为了充分利用集群的网卡资源，提高通信效率，2D-Attention在做 ``context parallel`` 的时候，引入了一个 ``window size`` 概念，即为Double-Ring Attention。下图展示了 ``cp=8`` ， ``window_size=4`` 的例子。GPU 0~3和GPU 4~7内部分别做inner Ring Attention，进行节点内P2P通信。GPU 0和4做Outer Ring Attention，进行节点间P2P通信，网卡利用示意图如下图所示。 
+
+.. figure:: ../../imgs/double-ring.PNG
+  :scale: 80%
+  :class: with-border
+
+  Double-Ring-Attention
+
+- 3. 由于2D-Attention中同时涉及到 ``head parallel`` 和 ``context parallel`` ，因此InternEvo提供了可配置选项，用于控制 ``head parallel`` 和 ``context parallel`` 创建通信组的优先级
+- 4. 为了充分利用网卡资源，需要特别注意创建 ``context parallel`` 通信组。当 ``head parallel`` 优先创建通信组， ``context parallel`` 的GPU天然就是interleaved，这时天然能够利用网卡资源；当 ``context parallel`` 优先创建通信组时，这些 ``context parallel`` 被分配到的GPU往往是连续的，为了提高通信效率，InternEvo提供了interleaved配置选项，可以在 ``window size > 1`` 的情况，重排 ``context parallel`` 的GPU。
+
+下图展示了一个Double-Ring-Attention充分利用网卡资源的示例。
+
+.. figure:: ../../imgs/nic.PNG
+  :scale: 80%
+  :class: with-border
+
+  Communication in Double-Ring-Attention 
+
+InternEvo在parallel config里面添加了sequence_2D用于配置2D-Attention。
+
+.. code-block:: python
+
+    parallel = dict(
+        zero1=dict(size=-1),
+        tensor=dict(size=2, mode="isp"),
+        pipeline=dict(size=1, interleaved_overlap=True),
+        weight=dict(size=4, overlap=True, memory_pool=False),
+        sequence_2D=dict(
+            enable=False,
+            head_size=2,
+            context_size=4,
+            window_size=1,
+            device_placement_strategy=dict(head_first=True, interleaved=False),
+        ),
+    )
+
+
+``sequence_2D.enable`` 字段表示是否启用2D-Attention
+
+``sequence_2D.head_size`` 字段表示head parallel size
+
+``sequence_2D.context_size`` 字段表示context parallel size
+
+``sequence_2D.window_size`` 字段表示Double-Ring Attention中的window_size
+
+``sequence_2D.device_placement_strategy.head_first`` 字段表示是否优先分配head parallel通信组，若为False，则为context-first
+
+``sequence_2D.device_placement_strategy.interleavd`` 字段表示是否对context parallel的GPU重排，该字段在 ``sequence_2D.device_placement_strategy.head_first=False`` 和 ``sequence_2D.window_size>1`` 时，推荐设置为 ``True``
+
+关于 2D-Attention更多的设计思路和性能评测，请参考论文 `LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism <https://arxiv.org/pdf/2406.18485>`_ 
diff --git a/doc/imgs/2d-attn.PNG b/doc/imgs/2d-attn.PNG
new file mode 100644
index 00000000..e8397464
Binary files /dev/null and b/doc/imgs/2d-attn.PNG differ
diff --git a/doc/imgs/double-ring.PNG b/doc/imgs/double-ring.PNG
new file mode 100644
index 00000000..4855d8ef
Binary files /dev/null and b/doc/imgs/double-ring.PNG differ
diff --git a/doc/imgs/nic.PNG b/doc/imgs/nic.PNG
new file mode 100644
index 00000000..5c9209b3
Binary files /dev/null and b/doc/imgs/nic.PNG differ