-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems in multi-card distributed training #3635
Comments
When I run it, it will card the owner. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 |
@eunwoosh could you take a look at this issue? |
I found some open issues related to this and commented on them. |
Hi @nowbug , thanks for finding the issue. First of all I want to say that OTX 2.0 currently doesn't validate distributed training, so it can be a little bit unstable. Nevertheless, OTX is based on pytorch lightning, so I think distributed training is available in most cases. OTX have a plan to support distributed training in the near future, so it can become stable soon. |
@eunwoosh Thank you for your response. I'm looking forward to the upcoming versions of OTX. |
The problem of distributed training blocking
Steps to Reproduce
1、Minimum code block
from otx.engine import Engine
engine = Engine(model="yolox_s", data_root="pwd")
engine.train(num_nodes=2)
2.I tried other code to troubleshoot my environment.
import lightning as L
from lightning.pytorch.demos.boring_classes import BoringModel
ngpus = 2
model = BoringModel()
trainer = L.Trainer(max_epochs=10,
devices=ngpus)
trainer.fit(model)
log:
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
Environment:
The text was updated successfully, but these errors were encountered: