Conflict between precision and plugins arguments in Trainer #8949

adamlin120 · 2024-04-17T01:42:58Z

Describe the bug

When attempting to train a model using Nemo 2403, a ValueError is raised indicating a conflict between the precision argument and the plugins argument being passed to the Trainer. The error message states that both precision=bf16-mixed and plugins=PipelineMixedPrecisionPlugin were received and cannot be used together.

Traceback (most recent call last):
  File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py", line 167, in main
    trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer, callbacks=callbacks)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 401, in __init__
    self._accelerator_connector = _AcceleratorConnector(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 134, in __init__
    self._check_config_and_set_final_flags(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 271, in _check_config_and_set_final_flags
    raise ValueError(
ValueError: Received both `precision=bf16-mixed` and `plugins=<nemo.collections.nlp.parts.nlp_overrides.PipelineMixedPrecisionPlugin object at 0x1551053e0d90>`. Choose one.

Steps/Code to reproduce bug

megatron_gpt_continue_training mixtral 8x22b with Nemo 2403 image.

Expected behavior

This issue seems to be related to Issue #8848, where a similar error was encountered when converting Mistral/Mixtral models to the Nemo format.
PR #8908 partially addressed this issue by removing precision args in the trainer due to a PyTorch Lightning (PTL) update. However, the changes were not applied to the training script.

The text was updated successfully, but these errors were encountered:

akoumpa · 2024-04-26T19:56:36Z

@adamlin120 thanks for reporting the issue.

Can you please try main...akoumparouli/update_megatron_gpt_cont_training and let me know if that fixes the issue for you? thanks.

github-actions · 2024-05-27T01:46:53Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-06-03T01:48:23Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

adamlin120 added the bug Something isn't working label Apr 17, 2024

akoumpa self-assigned this Apr 26, 2024

github-actions bot added the stale label May 27, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 3, 2024

akoumpa removed the stale label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflict between precision and plugins arguments in Trainer #8949

Conflict between precision and plugins arguments in Trainer #8949

adamlin120 commented Apr 17, 2024

akoumpa commented Apr 26, 2024

github-actions bot commented May 27, 2024

github-actions bot commented Jun 3, 2024

Conflict between precision and plugins arguments in Trainer #8949

Conflict between precision and plugins arguments in Trainer #8949

Comments

adamlin120 commented Apr 17, 2024

akoumpa commented Apr 26, 2024

github-actions bot commented May 27, 2024

github-actions bot commented Jun 3, 2024