Train with Multi Gpu #11

kevinchow1993 · 2021-05-07T06:36:47Z

当前代码在使用多卡训练时会出现 stop iteration的错，原因是某个卡上分配的数据比其他卡少，根本原因是由于在active_datasets.py中的create_X_L_file() 和 create_X_U_file()，多个卡会同时写同一个txt文件，导致先写完这个文件的卡创建dataloader时读取到了不全的txt.
解决方案：

在这两个函数中写文件时先随机sleep一小段时间,错开写文件的时间

    time.sleep(random.uniform(0,3))
    if not osp.exists(save_path):
        mmcv.mkdir_or_exist(save_folder)
        np.savetxt(save_path, ann[X_L_single], fmt='%s')

在tools/train.py中每次create_xx_file后，同步各个卡的线程。加上这句

          if dist.is_initialized():
              torch.distributed.barrier()

yuantn · 2021-05-07T06:39:53Z

非常感谢！这对于在多个 GPU 上训练将会非常有帮助。

Many thanks! This would be very useful for training on multiple GPUs.

chufengt · 2021-08-21T09:32:27Z

@kevinchow1993
您好，我根据您的描述修改了代码，但是还是遇到StopIteration的错误，请问您有什么建议呢？

在create_X_U_file和create_X_L_file修改以下部分

time.sleep(random.uniform(0,3))  
save_path = save_folder + '/trainval_X_U_' + year + '.txt'  
if not osp.exists(save_path):  
    mmcv.mkdir_or_exist(save_folder)  
    np.savetxt(save_path, ann[X_U_single], fmt='%s')  
X_U_path.append(save_path)

在train.py中每一处create_xx_file 后增加同步线程的代码

yuantn · 2021-08-23T07:55:24Z

如果这种方式不起作用的话，我认为您也可以试试这样修改：
在 tools/train.py 中 create_X_L_file 和 create_X_U_file 之前添加一行条件：

if torch.cuda.current_devices() == 0:

在 create_X_L_file 和 create_X_U_file 之后再添加同步线程：

if dist.is_initialized():
    torch.distributed.barrier()

If it does not work, I think you can also try like this:
Add a condition before create_X_L_file and create_X_U_file in tools/train.py:

if torch.cuda.current_devices() == 0:

Add threads synchronization after create_X_L_file and create_X_U_file:

if dist.is_initialized():
     torch.distributed.barrier()

chufengt · 2021-08-23T09:09:32Z

@yuantn
我进行了如下修改:

if torch.cuda.current_device() == 0:
    cfg = create_X_L_file(cfg, X_L, all_anns, cycle)
if dist.is_initialized():
    torch.distributed.barrier()

会在第一次save checkpoint时卡住

yuantn · 2021-08-23T09:26:14Z

是否还需要把返回的 cfg 分配给每张 GPU 上？具体如下：

if torch.cuda.current_device() == 0:
    cfg_save = create_X_L_file(cfg, X_L, all_anns, cycle)
    joblib.dump(cfg_save, 'cfg_save.tmp')
if dist.is_initialized():
    torch.distributed.barrier()
cfg = joblib.load("cfg_save.tmp")

Is it necessary to distribute the return cfg to each GPU? The code is as follows:

if torch.cuda.current_device() == 0:
    cfg_save = create_X_L_file(cfg, X_L, all_anns, cycle)
    joblib.dump(cfg_save, 'cfg_save.tmp')
if dist.is_initialized():
    torch.distributed.barrier()
cfg = joblib.load("cfg_save.tmp")

yuantn added bug Something isn't working in-depth Deep and valuable discussion enhancement New feature or request labels May 7, 2021

yuantn closed this as completed May 7, 2021

yuantn mentioned this issue Jun 30, 2021

AttributeError: 'NoneType' object has no attribute 'param_lambda' #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train with Multi Gpu #11

Train with Multi Gpu #11

kevinchow1993 commented May 7, 2021

yuantn commented May 7, 2021 •

edited

Loading

chufengt commented Aug 21, 2021

yuantn commented Aug 23, 2021

chufengt commented Aug 23, 2021 •

edited

Loading

yuantn commented Aug 23, 2021

Train with Multi Gpu #11

Train with Multi Gpu #11

Comments

kevinchow1993 commented May 7, 2021

yuantn commented May 7, 2021 • edited Loading

chufengt commented Aug 21, 2021

yuantn commented Aug 23, 2021

chufengt commented Aug 23, 2021 • edited Loading

yuantn commented Aug 23, 2021

yuantn commented May 7, 2021 •

edited

Loading

chufengt commented Aug 23, 2021 •

edited

Loading