Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in multi-card distributed training #3635

Closed
nowbug opened this issue Jun 18, 2024 · 5 comments · Fixed by #3904
Closed

Problems in multi-card distributed training #3635

nowbug opened this issue Jun 18, 2024 · 5 comments · Fixed by #3904
Assignees
Labels
BUG Something isn't working

Comments

@nowbug
Copy link
Contributor

nowbug commented Jun 18, 2024

The problem of distributed training blocking

Steps to Reproduce

1、Minimum code block

from otx.engine import Engine

engine = Engine(model="yolox_s", data_root="pwd")
engine.train(num_nodes=2)

2.I tried other code to troubleshoot my environment.

import lightning as L
from lightning.pytorch.demos.boring_classes import BoringModel

ngpus = 2
model = BoringModel()
trainer = L.Trainer(max_epochs=10,
devices=ngpus)

trainer.fit(model)

log:
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

Environment:

  • OS:
  • Framework version:
  • Python version: 3.10
  • OpenVINO version:
  • CUDA/cuDNN version: 12.2
  • GPU model and memory: 24G(4090)*2
@nowbug
Copy link
Contributor Author

nowbug commented Jun 18, 2024

When I run it, it will card the owner. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

@harimkang
Copy link
Contributor

@eunwoosh could you take a look at this issue?

@harimkang harimkang added BUG Something isn't working OTX 2.0 labels Jun 19, 2024
@harimkang
Copy link
Contributor

@eunwoosh
Copy link
Contributor

Hi @nowbug , thanks for finding the issue. First of all I want to say that OTX 2.0 currently doesn't validate distributed training, so it can be a little bit unstable. Nevertheless, OTX is based on pytorch lightning, so I think distributed training is available in most cases. OTX have a plan to support distributed training in the near future, so it can become stable soon.
And I tested with your second code snippet, and I found a bug as @harimkang said. So, I opened PR to fix it.
I also found that distributed training is stuck in some cases, and I suspect number of dataset is cause of the problem. I'll fix that bug after finding more.

@nowbug
Copy link
Contributor Author

nowbug commented Jun 20, 2024

@eunwoosh Thank you for your response. I'm looking forward to the upcoming versions of OTX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working
Projects
None yet
3 participants