Skip to content

Commit

Permalink
Use GPTModel from mcore (NVIDIA#7093)
Browse files Browse the repository at this point in the history
* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* set vp size to none if it is 1

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* add todo

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <complex451@gmail.com>

* small clean up

Signed-off-by: ericharper <complex451@gmail.com>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <complex451@gmail.com>

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <complex451@gmail.com>

* remove args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to test

Signed-off-by: ericharper <complex451@gmail.com>

* get hidden_size from config

Signed-off-by: ericharper <complex451@gmail.com>

* add try except

Signed-off-by: ericharper <complex451@gmail.com>

* use default

Signed-off-by: ericharper <complex451@gmail.com>

* update config with hidden size

Signed-off-by: ericharper <complex451@gmail.com>

* remove arg

Signed-off-by: ericharper <complex451@gmail.com>

* comment out jenkins test

Signed-off-by: ericharper <complex451@gmail.com>

* revert import

Signed-off-by: ericharper <complex451@gmail.com>

* remove optimizer_idx

Signed-off-by: eharper <eharper@nvidia.com>

* prefetch num microbatches

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <complex451@gmail.com>

* set model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* use model parallel config object

Signed-off-by: ericharper <complex451@gmail.com>

* update args

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <complex451@gmail.com>

* revert to model parallel config

Signed-off-by: ericharper <complex451@gmail.com>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <complex451@gmail.com>

* remove imports

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <complex451@gmail.com>

* add config to self

Signed-off-by: ericharper <complex451@gmail.com>

* build transformer config

Signed-off-by: ericharper <complex451@gmail.com>

* add model to provider func

Signed-off-by: ericharper <complex451@gmail.com>

* update forward and float16 wrapper

Signed-off-by: ericharper <complex451@gmail.com>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <complex451@gmail.com>

* set virtual rank

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (NVIDIA#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <jasonwan@nvidia.com>

---------

Signed-off-by: jasonwan <jasonwan@nvidia.com>

* revert

Signed-off-by: ericharper <complex451@gmail.com>

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update for dist adam

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <eharper@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update megatron core commit

Signed-off-by: eharper <eharper@nvidia.com>

* revert change

Signed-off-by: eharper <eharper@nvidia.com>

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

* remove import

Signed-off-by: eharper <eharper@nvidia.com>

---------

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: jasonwan <jasonwan@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jason Wang <jasonwan@nvidia.com>
  • Loading branch information
3 people authored Aug 14, 2023
1 parent b9e9362 commit af21e99
Show file tree
Hide file tree
Showing 6 changed files with 373 additions and 165 deletions.
4 changes: 2 additions & 2 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ pipeline {

stage('Megatron Core installation') {
steps {
// commit points to core_transformer merge
// commit has api fix for TE
sh 'git clone https://github.com/NVIDIA/Megatron-LM.git && \
cd Megatron-LM && \
git checkout 3316e811cc5335ee24c2d203416d864edcf2f7a8 && \
git checkout 0609f27fe8376f17ab65c001d3d8f35cd8175950 && \
pip install -e .'
}
}
Expand Down
4 changes: 4 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ exp_manager:
model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}

model:
# use GPTModel from megatron.core
mcore_gpt: False

# specify micro_batch_size, global_batch_size, and model parallelism
# gradient accumulation will be done automatically based on data_parallel_size
micro_batch_size: 4 # limited by GPU memory
Expand Down Expand Up @@ -87,6 +90,7 @@ model:
overlap_p2p_comm: False # Overlap p2p communication with computes. This argument is valid only when `virtual_pipeline_model_parallel_size` is larger than 1
batch_p2p_comm: True # Batch consecutive inter-peer send/recv operations. This argument is valid only when `virtual_pipeline_model_parallel_size` is larger than 1
seq_len_interpolation_factor: null # RoPE Interpolation factor for sequence length. This is used to build long-context models with RoPE ex: https://arxiv.org/abs/2306.15595.
num_query_groups: null # Number of query groups for group query attention. If None, normal attention is used.

tokenizer:
library: 'megatron'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):

if vp_size is not None:
if vp_size == 1:
self.cfg.virtual_pipeline_model_parallel_size = None
vp_size = None
else:
assert (
self.cfg.num_layers // self.cfg.pipeline_model_parallel_size
Expand All @@ -141,14 +141,14 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
world_size=init_world_size,
global_rank=init_global_rank,
local_rank=init_local_rank,
tensor_model_parallel_size=self.cfg.get('tensor_model_parallel_size', 1),
pipeline_model_parallel_size=self.cfg.get('pipeline_model_parallel_size', 1),
virtual_pipeline_model_parallel_size=self.cfg.get('virtual_pipeline_model_parallel_size', None),
pipeline_model_parallel_split_rank=self.cfg.get('pipeline_model_parallel_split_rank', 0),
micro_batch_size=self.cfg.get('micro_batch_size'),
global_batch_size=self.cfg.get('global_batch_size'),
rampup_batch_size=self.cfg.get('rampup_batch_size'),
use_fp8=self.cfg.get('fp8', False),
tensor_model_parallel_size=cfg.get('tensor_model_parallel_size', 1),
pipeline_model_parallel_size=cfg.get('pipeline_model_parallel_size', 1),
virtual_pipeline_model_parallel_size=vp_size,
pipeline_model_parallel_split_rank=cfg.get('pipeline_model_parallel_split_rank', 0),
micro_batch_size=cfg.get('micro_batch_size'),
global_batch_size=cfg.get('global_batch_size'),
rampup_batch_size=cfg.get('rampup_batch_size'),
use_fp8=cfg.get('fp8', False),
init_mpi_proc_group=cfg.get('ub_tp_comm_overlap', False),
seed=self.cfg.get('seed', 1234),
apex_transformer_log_level=self.cfg.get('apex_transformer_log_level', 30),
Expand All @@ -157,6 +157,9 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
# This must be called after initialize model parallel since it needs to know the data parallel size
self._validate_and_override_config()

# set the megatron core model parallel config
self.model_parallel_config: ModelParallelConfig = self.build_model_parallel_config()

self.grad_clip_pl_default = False # use pytorch default for gradient clipping. Default False

if hasattr(self._cfg, "tokenizer") or (
Expand Down Expand Up @@ -643,8 +646,13 @@ def _get_total_params_across_model_parallel_groups_gpt_bert(self, model):
and parallel_state.is_pipeline_last_stage(ignore_virtual=True)
and self.cfg.get('share_embeddings_and_output_weights', True)
):
word_embeddings_weight = (
model[-1].module.shared_embedding_or_output_weight()
if getattr(self, 'mcore_gpt', False)
else model[-1].word_embeddings_weight()
)
# substract the embedding weights on the last virtual stage
num_word_embedding_parameters = sum([p.nelement() for p in model[-1].word_embeddings_weight()])
num_word_embedding_parameters = sum([p.nelement() for p in word_embeddings_weight])
num_parameters_on_device -= num_word_embedding_parameters
else:
num_parameters_on_device = sum([p.nelement() for p in model.parameters()])
Expand All @@ -653,8 +661,13 @@ def _get_total_params_across_model_parallel_groups_gpt_bert(self, model):
and parallel_state.is_pipeline_last_stage(ignore_virtual=True)
and self.cfg.get('share_embeddings_and_output_weights', True)
):
word_embeddings_weight = (
model.module.shared_embedding_or_output_weight()
if getattr(self, 'mcore_gpt', False)
else model.word_embeddings_weight()
)
# substract the embedding weights on the last stage
num_word_embedding_parameters = sum([p.nelement() for p in model.word_embeddings_weight()])
num_word_embedding_parameters = sum([p.nelement() for p in word_embeddings_weight])
num_parameters_on_device -= num_word_embedding_parameters

# to be summed across data parallel group
Expand Down Expand Up @@ -714,7 +727,7 @@ def _get_total_params_across_model_parallel_groups_enc_dec(self, model):
torch.distributed.all_reduce(total_num_parameters, group=parallel_state.get_model_parallel_group())
return num_parameters_on_device, total_num_parameters

def build_model_parallel_config(self):
def build_model_parallel_config(self) -> ModelParallelConfig:
""" For attributes in the nemo model config that are the same as the
megatron core ModelParallelConfig we will use the value from the nemo config.
For attributes in ModelParallelConfig that are not in the nemo model config, we add custom logic.
Expand Down Expand Up @@ -742,7 +755,7 @@ def build_model_parallel_config(self):
"fp16": False, # NeMo does not currently support fp16 training with megatron amp O2
"bf16": precision == 'bf16' and megatron_amp_O2,
"params_dtype": params_dtype,
"timers": None, # NeMo dues not currently support megatron core timers
"timers": None, # NeMo does not currently support megatron core timers
"async_tensor_model_parallel_allreduce": self.cfg.get('tensor_model_parallel_world_size', 1) > 1
and not self.cfg.get('sequence_parallel', False),
"pipeline_dtype": pipeline_dtype,
Expand Down
Loading

0 comments on commit af21e99

Please sign in to comment.