Problems in multi-card distributed training #3635

nowbug · 2024-06-18T07:44:54Z

The problem of distributed training blocking

Steps to Reproduce

1、Minimum code block

from otx.engine import Engine

engine = Engine(model="yolox_s", data_root="pwd")
engine.train(num_nodes=2)

2.I tried other code to troubleshoot my environment.

import lightning as L
from lightning.pytorch.demos.boring_classes import BoringModel

ngpus = 2
model = BoringModel()
trainer = L.Trainer(max_epochs=10,
devices=ngpus)

trainer.fit(model)

log:
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

Environment:

OS:
Framework version:
Python version: 3.10
OpenVINO version:
CUDA/cuDNN version: 12.2
GPU model and memory: 24G(4090)*2

nowbug · 2024-06-18T07:46:05Z

When I run it, it will card the owner. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

harimkang · 2024-06-19T00:18:19Z

@eunwoosh could you take a look at this issue?

harimkang · 2024-06-20T05:44:31Z

I found some open issues related to this and commented on them.

[Bug] No backend type associated with device type cpu Lightning-AI/torchmetrics#2477
[Bug] RuntimeError: No backend type associated with device type cpu Lightning-AI/pytorch-lightning#18803

eunwoosh · 2024-06-20T06:10:18Z

Hi @nowbug , thanks for finding the issue. First of all I want to say that OTX 2.0 currently doesn't validate distributed training, so it can be a little bit unstable. Nevertheless, OTX is based on pytorch lightning, so I think distributed training is available in most cases. OTX have a plan to support distributed training in the near future, so it can become stable soon.
And I tested with your second code snippet, and I found a bug as @harimkang said. So, I opened PR to fix it.
I also found that distributed training is stuck in some cases, and I suspect number of dataset is cause of the problem. I'll fix that bug after finding more.

nowbug · 2024-06-20T06:25:10Z

@eunwoosh Thank you for your response. I'm looking forward to the upcoming versions of OTX.

harimkang assigned eunwoosh Jun 19, 2024

harimkang added BUG Something isn't working OTX 2.0 labels Jun 19, 2024

eunwoosh mentioned this issue Jun 20, 2024

Move value to device before logging for metric #3649

Merged

8 tasks

eunwoosh mentioned this issue Jul 31, 2024

Add num_devices in Engine for multi-gpu training #3778

Merged

8 tasks

eunwoosh mentioned this issue Aug 28, 2024

Fix a bug that training is stuck while detection model is trained on distrubited environment #3904

Merged

8 tasks

eunwoosh closed this as completed in #3904 Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems in multi-card distributed training #3635

Problems in multi-card distributed training #3635

nowbug commented Jun 18, 2024

nowbug commented Jun 18, 2024

harimkang commented Jun 19, 2024

harimkang commented Jun 20, 2024

eunwoosh commented Jun 20, 2024

nowbug commented Jun 20, 2024

Problems in multi-card distributed training #3635

Problems in multi-card distributed training #3635

Comments

nowbug commented Jun 18, 2024

log: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

nowbug commented Jun 18, 2024

harimkang commented Jun 19, 2024

harimkang commented Jun 20, 2024

eunwoosh commented Jun 20, 2024

nowbug commented Jun 20, 2024

log:
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2