Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The pytorchJob training is slow #1532

Closed
allendred opened this issue Feb 9, 2022 · 8 comments
Closed

The pytorchJob training is slow #1532

allendred opened this issue Feb 9, 2022 · 8 comments

Comments

@allendred
Copy link

I use this repository https://github.com/Shuai-Xie/mnist-pytorchjob-example for speed testing, I have more than two k8s nodes, and test the speed difference between two machines with four cards on the same node and two machines with four cards on different nodes. Each node has 4 Tesla V100 (32G Version) cards. Below is the comparison result.
@Shuai-Xie
The different nodes.

Using nccl>>> Using nccl

Using CUDA
Using CUDA
Train Epoch: 1 [0/15018 (0%)] loss=7.0561
Train Epoch: 1 [0/15018 (0%)] loss=7.1878
Train Epoch: 1 [1280/15018 (33%)] loss=3.3173
Train Epoch: 1 [1280/15018 (33%)] loss=3.3822
Train Epoch: 1 [2560/15018 (67%)] loss=3.3661
Train Epoch: 1 [2560/15018 (67%)] loss=3.2486

accuracy=0.0000

time >>: 148.1217818260193
----forwardtime: 4.269739866256714 ----backwardtime 99.48042893409729
testtime: 5.997300148010254

accuracy=0.0000

time >>: 148.90791296958923
----forwardtime: 7.287083387374878 ----backwardtime 98.69852066040039
testtime: 6.474621534347534

The same nodes

Using nccl>>> Using nccl

Using CUDA>>> Using CUDA

Train Epoch: 1 [0/15018 (0%)] loss=7.1878
Train Epoch: 1 [0/15018 (0%)] loss=7.0561
Train Epoch: 1 [1280/15018 (33%)] loss=3.3408
Train Epoch: 1 [1280/15018 (33%)] loss=3.3714
Train Epoch: 1 [2560/15018 (67%)] loss=3.2577
Train Epoch: 1 [2560/15018 (67%)] loss=3.3267

accuracy=0.0000

time >>: 73.05248022079468
----forwardtime: 4.658957004547119 ----backwardtime 26.686575412750244
testtime: 7.037155389785767

accuracy=0.0000

time >>: 73.65188813209534
----forwardtime: 5.2202184200286865 ----backwardtime 26.243772506713867
testtime: 6.42363977432251

Please ignore the accuracy(I used a special dataset)

System Information
Linux version 3.10.0-1062.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Wed Aug 7 18:08:02 UTC 2019

K8s Version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:25:59Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:20:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Kubeflow Version
v1.3

@allendred
Copy link
Author

Pytorch Version 1.7.1+cu102

Training code

# -*- coding: utf-8 -*-
import argparse
import os
import time
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms, models
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


def train(args, model, device, train_loader, optimizer, epoch, writer):
    model.train()
    LossFunc = nn.CrossEntropyLoss()
    rtime = 0 
    backwardtime = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        epstarttime = time.time()
        output = model(data)
        rtime += time.time() - epstarttime
        backstart = time.time()
        loss = LossFunc(output, target)#F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        backwardtime += time.time() - backstart
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            niter = epoch * len(train_loader) + batch_idx
            writer.add_scalar('loss', loss.item(), niter)

        #rtime += time.time() - epstarttime
    return rtime,backwardtime

def test(args, model, device, test_loader, writer, epoch):
    model.eval()
    test_loss = 0
    correct = 0
    rtime = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            epstarttime = time.time()
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
            rtime += time.time() - epstarttime

    test_loss /= len(test_loader.dataset)
    print('\naccuracy={:.4f}\n'.format(float(correct) / len(test_loader.dataset)))
    writer.add_scalar('accuracy', float(correct) / len(test_loader.dataset), epoch)
    return rtime


def main():
    """
    """
    # Training settings
    # s = time.time()
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--dataset', type=str, default='./data',
                    help='For Saving the current Model')
    parser.add_argument('--batch-size', type=int, default=128, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--model', type=str, default="",
                        help='For Saving the current Model')
    parser.add_argument('--dir', default='logs', metavar='L',
                        help='directory where summary logs are stored')
    parser.add_argument("--local_rank", type=int, default=0)
    if dist.is_available():
        parser.add_argument('--backend', type=str, help='Distributed backend',
                            choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
                            default=dist.Backend.NCCL)
    args = parser.parse_args()
    local_rank = args.local_rank
    torch.cuda.set_device(local_rank)
    torch.distributed.init_process_group(backend=args.backend)
    print('>>> Using {}'.format(args.backend))
    
    # 是否使用cuda
    use_cuda = not args.no_cuda and torch.cuda.is_available()
    if use_cuda:
        print('>>> Using CUDA')

    torch.manual_seed(args.seed)

    device = torch.device("cuda", local_rank)
    # dataset = datasets.ImageFolder(args.dataset+"/train", 
    #                    transform=transforms.Compose([
    #                        transforms.Resize([224, 224]),
    #                        transforms.ToTensor()
    #                    ]))
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    dataset = datasets.ImageFolder(
        args.dataset+"/train",
        transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))

    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
    kwargs = {'num_workers': 4, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, **kwargs)

    # test_loader = torch.utils.data.DataLoader(
    #     datasets.ImageFolder(args.dataset+"/val", transform=transforms.Compose([
    #                        transforms.Resize([224, 224]),
    #                        transforms.ToTensor()
    #                    ])),
    #     batch_size=args.test_batch_size, shuffle=False, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(args.dataset+"/val", transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ])),
        batch_size=args.test_batch_size, shuffle=False, **kwargs)


    model = models.resnet152().to(device)
   #model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
    model=torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
    s = time.time()
    time_list = []
    for epoch in range(1, args.epochs + 1):
        writer = SummaryWriter(log_dir=args.model)
        forwardtime,backwardtime = train(args, model, device, train_loader, optimizer, epoch, writer)
        testtime = test(args, model, device, test_loader, writer, epoch)
        now_time = time.time() - s
        print("time >>:",now_time)
        print("----forwardtime: ",forwardtime,'----backwardtime', backwardtime)
        print("testtime:",testtime)
        time_list.append(time.time() - s)

    if torch.distributed.get_rank() == 0:
        if not os.path.exists(args.model):
            os.makedirs(args.model)
        torch.save(model.state_dict(), args.model+"/model.pt")
    print(">>> total time:", time.time() - s )
    print(time_list)
    with open("./time.text", 'w', encoding='utf-8') as file:
        file.write(str(time_list)+"\n")
if __name__ == '__main__':
    main()

yaml

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "test-209-async"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeName: m01
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: train-dir
            nfs:
              path: /data/nfs/graph/train/imagenet
              server: 192.168.0.176
          - name: script-dir
            nfs:
              path: /data/nfs/graph/script/train
              server: 192.168.0.176
          containers:
            - name: pytorch
              image: pytorch_171_dgl_072
              imagePullPolicy: IfNotPresent
              command: ["sh","-c","python -m torch.distributed.launch --nnodes=2 --nproc_per_node=2 --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT}  /home/gnn/train/code/torch_multi_gpu.py --dataset=/home/gnn/train/data --model=/home/gnn/train/model",]
              resources:
                requests:
                  memory: "16Gi"
                  cpu: "4"
                  nvidia.com/gpu: 2
                limits:
                  memory: "16Gi"
                  cpu: "4"
                  nvidia.com/gpu: 2
              volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /home/gnn/train
                name: train-dir
              - mountPath: /home/gnn/script
                name: script-dir
          # hostIPC: true
          # hostNetwork: true
          # dnsPolicy: "ClusterFirstWithHostNet"
          # affinity:
          #   nodeAffinity:
          #     requiredDuringSchedulingIgnoredDuringExecution:
          #       nodeSelectorTerms:
          #         - matchExpressions:
          #             - key: kubernetes.io/hostname
          #               operator: In
          #               values:
          #                 - gpu-10-252-192-48     # limit master must be 48
          #                 # - gpu-10-252-192-49
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeName: m02
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: train-dir
            nfs:
              path: /data/nfs/graph/train/imagenet
              server: 192.168.0.176
          - name: script-dir
            nfs:
              path: /data/nfs/graph/script/train
              server: 192.168.0.176
          containers:
            - name: pytorch
              image: pytorch_171_dgl_072
              imagePullPolicy: IfNotPresent
              command: ["sh","-c","python -m torch.distributed.launch --nnodes=2 --nproc_per_node=2 --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT}  /home/gnn/train/code/torch_multi_gpu.py --dataset=/home/gnn/train/data --model=/home/gnn/train/model",]

              resources:
                requests:
                  memory: "16Gi"
                  cpu: "4"
                  nvidia.com/gpu: 2
                limits:
                  memory: "16Gi"
                  cpu: "4"
                  nvidia.com/gpu: 2
              volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /home/gnn/train
                name: train-dir
              - mountPath: /home/gnn/script
                name: script-dir
          # hostIPC: true
          # hostNetwork: true
          # dnsPolicy: "ClusterFirstWithHostNet"
          # affinity:
          #   nodeAffinity:
          #     requiredDuringSchedulingIgnoredDuringExecution:
          #       nodeSelectorTerms:
          #         - matchExpressions:
          #             - key: kubernetes.io/hostname
          #               operator: In
          #               values:
          #                 - gpu-10-252-192-48
          #                 - gpu-10-252-192-49

@allendred
Copy link
Author

@gaocegege

@gaocegege
Copy link
Member

How about the network in the cluster?

@allendred
Copy link
Author

The time difference is mainly reflected in the optimizer
same nodes 9.449573755264282
different nodes 78.1211314201355

@allendred
Copy link
Author

How about the network in the cluster?

qperf test
tcp_bw:
bw = 1.14 GB/sec
tcp_lat:
latency = 63.2 us
Looks ok

@yihaocs
Copy link

yihaocs commented Sep 15, 2022

Hello, have you solve the training speed issues?

@allendred
Copy link
Author

After testing, I think it is a bandwidth problem

@Crazybean-lwb
Copy link

Crazybean-lwb commented Dec 27, 2022

Pytorch Version 1.7.1+cu102

Training code

# -*- coding: utf-8 -*-
import argparse
import os
import time
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms, models
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


def train(args, model, device, train_loader, optimizer, epoch, writer):
    model.train()
    LossFunc = nn.CrossEntropyLoss()
    rtime = 0 
    backwardtime = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        epstarttime = time.time()
        output = model(data)
        rtime += time.time() - epstarttime
        backstart = time.time()
        loss = LossFunc(output, target)#F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        backwardtime += time.time() - backstart
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            niter = epoch * len(train_loader) + batch_idx
            writer.add_scalar('loss', loss.item(), niter)

        #rtime += time.time() - epstarttime
    return rtime,backwardtime

def test(args, model, device, test_loader, writer, epoch):
    model.eval()
    test_loss = 0
    correct = 0
    rtime = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            epstarttime = time.time()
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
            rtime += time.time() - epstarttime

    test_loss /= len(test_loader.dataset)
    print('\naccuracy={:.4f}\n'.format(float(correct) / len(test_loader.dataset)))
    writer.add_scalar('accuracy', float(correct) / len(test_loader.dataset), epoch)
    return rtime


def main():
    """
    """
    # Training settings
    # s = time.time()
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--dataset', type=str, default='./data',
                    help='For Saving the current Model')
    parser.add_argument('--batch-size', type=int, default=128, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--model', type=str, default="",
                        help='For Saving the current Model')
    parser.add_argument('--dir', default='logs', metavar='L',
                        help='directory where summary logs are stored')
    parser.add_argument("--local_rank", type=int, default=0)
    if dist.is_available():
        parser.add_argument('--backend', type=str, help='Distributed backend',
                            choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
                            default=dist.Backend.NCCL)
    args = parser.parse_args()
    local_rank = args.local_rank
    torch.cuda.set_device(local_rank)
    torch.distributed.init_process_group(backend=args.backend)
    print('>>> Using {}'.format(args.backend))
    
    # 是否使用cuda
    use_cuda = not args.no_cuda and torch.cuda.is_available()
    if use_cuda:
        print('>>> Using CUDA')

    torch.manual_seed(args.seed)

    device = torch.device("cuda", local_rank)
    # dataset = datasets.ImageFolder(args.dataset+"/train", 
    #                    transform=transforms.Compose([
    #                        transforms.Resize([224, 224]),
    #                        transforms.ToTensor()
    #                    ]))
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    dataset = datasets.ImageFolder(
        args.dataset+"/train",
        transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))

    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
    kwargs = {'num_workers': 4, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, **kwargs)

    # test_loader = torch.utils.data.DataLoader(
    #     datasets.ImageFolder(args.dataset+"/val", transform=transforms.Compose([
    #                        transforms.Resize([224, 224]),
    #                        transforms.ToTensor()
    #                    ])),
    #     batch_size=args.test_batch_size, shuffle=False, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(args.dataset+"/val", transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ])),
        batch_size=args.test_batch_size, shuffle=False, **kwargs)


    model = models.resnet152().to(device)
   #model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
    model=torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
    s = time.time()
    time_list = []
    for epoch in range(1, args.epochs + 1):
        writer = SummaryWriter(log_dir=args.model)
        forwardtime,backwardtime = train(args, model, device, train_loader, optimizer, epoch, writer)
        testtime = test(args, model, device, test_loader, writer, epoch)
        now_time = time.time() - s
        print("time >>:",now_time)
        print("----forwardtime: ",forwardtime,'----backwardtime', backwardtime)
        print("testtime:",testtime)
        time_list.append(time.time() - s)

    if torch.distributed.get_rank() == 0:
        if not os.path.exists(args.model):
            os.makedirs(args.model)
        torch.save(model.state_dict(), args.model+"/model.pt")
    print(">>> total time:", time.time() - s )
    print(time_list)
    with open("./time.text", 'w', encoding='utf-8') as file:
        file.write(str(time_list)+"\n")
if __name__ == '__main__':
    main()

yaml

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "test-209-async"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeName: m01
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: train-dir
            nfs:
              path: /data/nfs/graph/train/imagenet
              server: 192.168.0.176
          - name: script-dir
            nfs:
              path: /data/nfs/graph/script/train
              server: 192.168.0.176
          containers:
            - name: pytorch
              image: pytorch_171_dgl_072
              imagePullPolicy: IfNotPresent
              command: ["sh","-c","python -m torch.distributed.launch --nnodes=2 --nproc_per_node=2 --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT}  /home/gnn/train/code/torch_multi_gpu.py --dataset=/home/gnn/train/data --model=/home/gnn/train/model",]
              resources:
                requests:
                  memory: "16Gi"
                  cpu: "4"
                  nvidia.com/gpu: 2
                limits:
                  memory: "16Gi"
                  cpu: "4"
                  nvidia.com/gpu: 2
              volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /home/gnn/train
                name: train-dir
              - mountPath: /home/gnn/script
                name: script-dir
          # hostIPC: true
          # hostNetwork: true
          # dnsPolicy: "ClusterFirstWithHostNet"
          # affinity:
          #   nodeAffinity:
          #     requiredDuringSchedulingIgnoredDuringExecution:
          #       nodeSelectorTerms:
          #         - matchExpressions:
          #             - key: kubernetes.io/hostname
          #               operator: In
          #               values:
          #                 - gpu-10-252-192-48     # limit master must be 48
          #                 # - gpu-10-252-192-49
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeName: m02
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: train-dir
            nfs:
              path: /data/nfs/graph/train/imagenet
              server: 192.168.0.176
          - name: script-dir
            nfs:
              path: /data/nfs/graph/script/train
              server: 192.168.0.176
          containers:
            - name: pytorch
              image: pytorch_171_dgl_072
              imagePullPolicy: IfNotPresent
              command: ["sh","-c","python -m torch.distributed.launch --nnodes=2 --nproc_per_node=2 --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT}  /home/gnn/train/code/torch_multi_gpu.py --dataset=/home/gnn/train/data --model=/home/gnn/train/model",]

              resources:
                requests:
                  memory: "16Gi"
                  cpu: "4"
                  nvidia.com/gpu: 2
                limits:
                  memory: "16Gi"
                  cpu: "4"
                  nvidia.com/gpu: 2
              volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /home/gnn/train
                name: train-dir
              - mountPath: /home/gnn/script
                name: script-dir
          # hostIPC: true
          # hostNetwork: true
          # dnsPolicy: "ClusterFirstWithHostNet"
          # affinity:
          #   nodeAffinity:
          #     requiredDuringSchedulingIgnoredDuringExecution:
          #       nodeSelectorTerms:
          #         - matchExpressions:
          #             - key: kubernetes.io/hostname
          #               operator: In
          #               values:
          #                 - gpu-10-252-192-48
          #                 - gpu-10-252-192-49

put aside the bandwidth problem, I have a question about the ddp torch.distributed.launch usage:
I have three nodes (each node have 8 gpus), how can I run multiple node ddp by torch.distributed.launch with training-operator?

I have tried yaml 1master+2workers, workers as follows

    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: repository/kubeflow/arrikto-playground/dimpo/ranzcr-dist:latest
              command: ["sh","-c",
                                  "python -m torch.distributed.launch 
                                   --nnodes=3 
                                   --nproc_per_node=8
                                   --node_rank=${RANK} 
                                   --master_addr=${MASTER_ADDR} 
                                   --master_port=${MASTER_PORT}  
                                   /home/jovyan/ddp/ddp-mul-gpu.py"
                                   ]
              imagePullPolicy: "Always"
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
                - mountPath: /home/jovyan
                  name: workspace-kaggle
              resources:
                limits:
                  memory: "10Gi"
                  cpu: "8"
                  nvidia.com/gpu: 8

Get log from workers:
飞书20221227-190050

the log seems showed:shell can not get env from pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants