feat: add LoRA adapter layer and Mixtral LoRA training #53

mpashkovskii · 2025-01-31T13:33:45Z

The PR adds:

LoraAdapter class to enable LoRA for models
Usage example scripts and documentation for Mixtral 8x7B LoRA

The focus mostly is on expert parallelism.

Out of the scope of the PR (will be added later):

MoE grouped GEMM
support of VocabParallelEmbedding and TopKRouter by LoraAdapter
LoRA examples for other models

nholmber · 2025-02-07T10:42:15Z

cc: @wenchenvincent

megatron/training/checkpointing.py

tools/checkpoint/loader_mixtral_hf.py

examples/mixtral/README.md

wenchenvincent · 2025-02-10T04:56:44Z

tools/checkpoint/convert.py

@@ -137,6 +137,7 @@ def main():

    args = parser.parse_args()

+    mp.set_start_method('spawn')


Why do we need this line here?

convert.py has flaky behaviour in different environments and it is not 100% clear if it is an issue:

in the script

or Python/PyTorch behaviour change since the script was created

or it is environment specific configuration

Sometimes start method is fork and that causes crashes of convert.py. To be on the safe side I explicitly set process start method to be spawn.

Does it use distributed checkpointing? I recalled that there was a similar with distributed checkpointing: #47

@zstreet87 Could you take a look at this change?

convert.py know nothing about checkpoint format - it is saver's responsibility. torch format is used when the saver is saver_mcore.py. I'm using saver_mcore.py

The issue with fork start method appears for torch checkpoint format but I think it is irrelevant to format.

examples/mixtral/lora_mixtral.py

wenchenvincent · 2025-02-10T05:10:44Z

megatron/core/transformer/lora_adapter.py

+    "skip_bias_add": True,
+}
+COLUMN_PARALLEL_LAYERS = [
+    partial(TELinear, **LORA_LAYERS_DEFAULT_CONFIG, init_method=KAIMING_INIT_METHOD, parallel_mode=None, skip_weight_param_allocation=False),


Do you mean Linear instead of TELinear here?

te.pytorch.Linear and especially torch.nn.Linear has quite different constructor signatures whereas TELinear constructor is aligned with ColumnParallelLinear, TEColumnParallelLinear etc and incapsulate those differences. To make the code more readable I explicitly used TELinear. But essentially, in this case, it is a thin wrapper around torch.nn.Linear.

TELinear is not a wrapper around torch.nn.Linear but te.pytorch.Linear.

In Megatron-LM, there are two alternative transformer implementation: local (using pytorch layers) and transformer-engine (using TE layers). ColumnParallelLinear uses pytorch Linear layers and TEColumnParallelLinear uses TE Linear layers. Usually, when a model is constructed with ColumnParallelLinear, it often means that TE is not available. So here it is not appropriate to use TELinear here.

Given that we cannot use torch.nn.Linear directly, it seems that we will also need to create a thin wrapper around torch.nn.Linear. And this also triggers another question from me: can we use the wrapper ColumnParallelLinear again for the second lora layer here?

Hmm, I actually have a further question whether this would work for TP or not.

It seems that for a base layer like TEColumnParallelLinear, we used two LoRA layers. The first layer is TELinear and the second layer is TEColumnParallelLinear. Does that mean the first layer will not be sliced to different GPUs?

Yes, indeed, I made a typo in the last sentence: TELinear wraps te.pytorch.Linear.

I implemented the Linear layer as a wrapper around ColumnParallelLinear. The main difference is that the weight output size must be non-sharded. To achieve this, I copied some code from the ColumnParallelLinear constructor.

Yes, your understanding is correct: all Linear/TELinear layers in the LoraAdapter are not sliced. This is a deliberate decision:

For TP, we sacrifice some memory to gain performance. Using a different approach would introduce approximately five additional inter-GPU calls per LoraAdapter.

For EP+PP, which, as we observed, is the most performant training configuration for MoE models, no layers in the model are sliced.

Thanks for the clarification! While I think it might be Okay to sacrifice some memory to gain performance, I am kinda concerned about the functionality.

So in the case of TP, the weights of the first lora layer is not sliced and the weights of second lora layer is sliced across different GPUs with the same TP group. And the input data is the same across the GPUs within the same TP group. When we pass the activation of the first lora layer to the second lora layer, how do we make sure it is sliced properly? And how do we make sure that the gradient reduction and accumulation is done properly for the backward pass? In the scheme of this PR, we are doing DP for the first lora layer and TP for the second lora layer within a TP group. The combination of these two might be error prone, we will need to have tests to make sure this is implemented correctly.

megatron/core/transformer/lora_adapter.py

tests/unit_tests/transformer/test_lora_adapter.py

examples/mixtral/lora_mixtral_8x7b_distributed.sh

megatron/core/transformer/lora_adapter.py

gurpreet-dhami · 2025-02-10T15:20:34Z

@wenchenvincent : I see this that PR didn't go through CI. Do you have any idea ?

tests/unit_tests/transformer/test_lora_adapter.py

mpashkovskii force-pushed the feat/mixtral-lora branch 13 times, most recently from 4b5b3fc to 31a02d4 Compare February 4, 2025 13:25

feat: add LoRA adapter layer and Mixtral LoRA training

e86131b

mpashkovskii force-pushed the feat/mixtral-lora branch from 31a02d4 to e86131b Compare February 4, 2025 13:26

mpashkovskii marked this pull request as ready for review February 7, 2025 07:56

fix: enable HuggingFace Mixtral 8x7B model conversion

0c59133

nholmber requested a review from wenchenvincent February 7, 2025 10:42

wenchenvincent requested review from zstreet87, lcskrishna and gurpreet-dhami February 7, 2025 17:00

wenchenvincent reviewed Feb 7, 2025

View reviewed changes

megatron/training/checkpointing.py Show resolved Hide resolved

wenchenvincent reviewed Feb 7, 2025

View reviewed changes

tools/checkpoint/loader_mixtral_hf.py Outdated Show resolved Hide resolved

Merge branch 'ROCm:rocm_dev' into feat/mixtral-lora

6fb5b1f