Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict between precision and plugins arguments in Trainer #8949

Closed
adamlin120 opened this issue Apr 17, 2024 · 3 comments
Closed

Conflict between precision and plugins arguments in Trainer #8949

adamlin120 opened this issue Apr 17, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@adamlin120
Copy link

Describe the bug

When attempting to train a model using Nemo 2403, a ValueError is raised indicating a conflict between the precision argument and the plugins argument being passed to the Trainer. The error message states that both precision=bf16-mixed and plugins=PipelineMixedPrecisionPlugin were received and cannot be used together.

Traceback (most recent call last):
  File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py", line 167, in main
    trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer, callbacks=callbacks)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 401, in __init__
    self._accelerator_connector = _AcceleratorConnector(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 134, in __init__
    self._check_config_and_set_final_flags(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 271, in _check_config_and_set_final_flags
    raise ValueError(
ValueError: Received both `precision=bf16-mixed` and `plugins=<nemo.collections.nlp.parts.nlp_overrides.PipelineMixedPrecisionPlugin object at 0x1551053e0d90>`. Choose one.

Steps/Code to reproduce bug

  1. megatron_gpt_continue_training mixtral 8x22b with Nemo 2403 image.

Expected behavior

  • This issue seems to be related to Issue #8848, where a similar error was encountered when converting Mistral/Mixtral models to the Nemo format.
  • PR #8908 partially addressed this issue by removing precision args in the trainer due to a PyTorch Lightning (PTL) update. However, the changes were not applied to the training script.
@adamlin120 adamlin120 added the bug Something isn't working label Apr 17, 2024
@akoumpa akoumpa self-assigned this Apr 26, 2024
@akoumpa
Copy link
Member

akoumpa commented Apr 26, 2024

@adamlin120 thanks for reporting the issue.

Can you please try main...akoumparouli/update_megatron_gpt_cont_training and let me know if that fixes the issue for you? thanks.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label May 27, 2024
Copy link
Contributor

github-actions bot commented Jun 3, 2024

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 3, 2024
@akoumpa akoumpa removed the stale label Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants