Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Issue #134

Open
AymanMIbrahim opened this issue Jul 7, 2023 · 11 comments
Open

Training Issue #134

AymanMIbrahim opened this issue Jul 7, 2023 · 11 comments

Comments

@AymanMIbrahim
Copy link

He doesn't read the img path or the camera path

use image path: /home/paperspace/Get3d_Updated/GET3D/Render_Image/, num images: 0 Traceback (most recent call last): File "train_3d.py", line 330, in <module> main() # pylint: disable=no-value-for-parameter File "/home/paperspace/miniconda3/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "/home/paperspace/miniconda3/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/home/paperspace/miniconda3/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/paperspace/miniconda3/envs/get3d/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "train_3d.py", line 324, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "train_3d.py", line 103, in launch_training subprocess_fn(rank=0, c=c, temp_dir=temp_dir) File "train_3d.py", line 49, in subprocess_fn training_loop_3d.training_loop(rank=rank, **c) File "/home/paperspace/Get3d_Updated/GET3D/training/training_loop_3d.py", line 134, in training_loop training_set_sampler = misc.InfiniteSampler( File "/home/paperspace/Get3d_Updated/GET3D/torch_utils/misc.py", line 120, in __init__ assert len(dataset) > 0 AssertionError

@SteveJunGao
Copy link
Collaborator

Hi @AymanMIbrahim,

It seems the error means there's no dataset, can you provide me with your training command? so I can check more on the path to the images and cameras.

@song-wensong
Copy link

Hi @AymanMIbrahim,

It seems the error means there's no dataset, can you provide me with your training command? so I can check more on the path to the images and cameras.

Hello @SteveJunGao
I came cross the same problem. The training command I used is
srun -p Ai4sci_3D --gres=gpu:1 --ntasks-per-node=1 --job-name=get3d_car python train_3d.py --outdir=PATH_TO_LOG --data=/mnt/petrelfs/songwensong/GET3D/render_shapenet_data/save_image/img --camera_path /mnt/petrelfs/songwensong/GET3D/render_shapenet_data/save_image/camera --gpus=1 --batch=4 --gamma=40 --data_camera_mode shapenet_car --dmtet_scale 1.0 --use_shapenet_split 1 --one_3d_generator 1 --fp32 0

I run code in clusters.

@SteveJunGao
Copy link
Collaborator

Hi @song-wensong,

I guess you might need to append one more directory name for the --data, a correct command is something like this:

--data=RENDER/img/03790512 --camera_path=RENDER/camera

@EadmondDai
Copy link

EadmondDai commented Nov 3, 2023

I am having the same issue:
==> start
==> use shapenet dataset
==> use shapenet folder number 0
==> use image path: /mnt/c/AIData/img/04460130, num images: 0
==> launch training

Training options:
{
"G_kwargs": {
"class_name": "training.networks_get3d.GeneratorDMTETMesh",
"z_dim": 512,
"w_dim": 512,
"mapping_kwargs": {
"num_layers": 8
},
"iso_surface": "flexicubes",
"one_3d_generator": true,
"n_implicit_layer": 1,
"deformation_multiplier": 1.0,
"use_style_mixing": true,
"dmtet_scale": 1.0,
"feat_channel": 16,
"mlp_latent_channel": 32,
"tri_plane_resolution": 256,
"n_views": 1,
"render_type": "neural_render",
"use_tri_plane": true,
"tet_res": 90,
"geometry_type": "conv3d",
"data_camera_mode": "shapenet_car",
"channel_base": 32768,
"channel_max": 512,
"fused_modconv_default": "inference_only"
},
"D_kwargs": {
"class_name": "training.networks_get3d.Discriminator",
"block_kwargs": {
"freeze_layers": 0
},
"mapping_kwargs": {},
"epilogue_kwargs": {
"mbstd_group_size": 4
},
"data_camera_mode": "shapenet_car",
"add_camera_cond": true,
"channel_base": 32768,
"channel_max": 512,
"architecture": "skip"
},
"G_opt_kwargs": {
"class_name": "torch.optim.Adam",
"betas": [
0,
0.99
],
"eps": 1e-08,
"lr": 0.002
},
"D_opt_kwargs": {
"class_name": "torch.optim.Adam",
"betas": [
0,
0.99
],
"eps": 1e-08,
"lr": 0.002
},
"loss_kwargs": {
"class_name": "training.loss.StyleGAN2Loss",
"gamma_mask": 40.0,
"r1_gamma": 40.0,
"lambda_flexicubes_surface_reg": 0.5,
"lambda_flexicubes_weights_reg": 0.1,
"style_mixing_prob": 0.9,
"pl_weight": 0.0
},
"data_loader_kwargs": {
"pin_memory": true,
"prefetch_factor": 2,
"num_workers": 3
},
"inference_vis": false,
"training_set_kwargs": {
"class_name": "training.dataset.ImageFolderDataset",
"path": "/mnt/c/AIData/img/04460130",
"use_labels": false,
"max_size": 0,
"xflip": false,
"resolution": 1024,
"data_camera_mode": "shapenet_car",
"add_camera_cond": true,
"camera_path": "=/mnt/c/AIData/camera",
"split": "train",
"random_seed": 0
},
"resume_pretrain": null,
"D_reg_interval": 16,
"num_gpus": 1,
"batch_size": 4,
"batch_gpu": 4,
"metrics": [
"fid50k"
],
"total_kimg": 20000,
"kimg_per_tick": 1,
"image_snapshot_ticks": 50,
"network_snapshot_ticks": 200,
"random_seed": 0,
"ema_kimg": 1.25,
"G_reg_interval": 4,
"run_dir": "result/00031-stylegan2-04460130-gpus1-batch4-gamma40"
}

Output directory: result/00031-stylegan2-04460130-gpus1-batch4-gamma40
Number of GPUs: 1
Batch size: 4 images
Training duration: 20000 kimg
Dataset path: /mnt/c/AIData/img/04460130
Dataset size: 0 images
Dataset resolution: 1024
Dataset labels: False
Dataset x-flips: False

Creating output directory...
Launching processes...
Setting up PyTorch plugin "upfirdn2d_plugin"... Done.
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "filtered_lrelu_plugin"... Done.
Loading training set...
------------ what in the training set kwargs ----------------- {'class_name': 'training.dataset.ImageFolderDataset', 'path': '/mnt/c/AIData/img/04460130', 'use_labels': False, 'max_size': 0, 'xflip': False, 'resolution': 1024, 'data_camera_mode': 'shapenet_car', 'add_camera_cond': True, 'camera_path': '=/mnt/c/AIData/camera', 'split': 'train', 'random_seed': 0}
==> use shapenet dataset
==> use shapenet folder number 0
==> use image path: /mnt/c/AIData/img/04460130, num images: 0
------------- of course this is null -------------------- <training.dataset.ImageFolderDataset object at 0x7ff0c3c6a2b0>
Traceback (most recent call last):
File "train_3d.py", line 337, in
main() # pylint: disable=no-value-for-parameter
File "/home/ead/anaconda3/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/ead/anaconda3/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ead/anaconda3/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ead/anaconda3/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "train_3d.py", line 331, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "train_3d.py", line 103, in launch_training
subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
File "train_3d.py", line 49, in subprocess_fn
training_loop_3d.training_loop(rank=rank, **c)
File "/home/ead/docker/GET3D/training/training_loop_3d.py", line 136, in training_loop
training_set_sampler = misc.InfiniteSampler(
File "/home/ead/docker/GET3D/torch_utils/misc.py", line 120, in init
assert len(dataset) > 0
AssertionError

The command I use is: python train_3d.py --outdir=result --data=/mnt/c/AIData/img/04460130 --camera_path =/mnt/c/AIData/camera --gpus=1 --batch=4 --gamma=40 --data_camera_mode shapenet_car --dmtet_scale 1.0 --use_shapenet_split 1 --one_3d_generator 1 --fp32 0 --
iso_surface flexicubes

I have check /mnt/c/AIData/img/04460130 folder, pretty sure I have a bunch of training data there.
This problem is so weird to me. I ran code in wsl ubuntu 22.4 in conda venv.

@Bathsheba
Copy link

In Windows WSL you may want to put the data in the WSL filesystem. If it is in /mnt/* I don't know whether that will affect Get3D finding the data, but I wouldn't rule it out. Certainly access will be slow .

@EadmondDai
Copy link

EadmondDai commented Nov 3, 2023

In Windows WSL you may want to put the data in the WSL filesystem. If it is in /mnt/* I don't know whether that will affect Get3D finding the data, but I wouldn't rule it out. Certainly access will be slow .

After trial and error, I think I got my copy started the process of training. But some other problem emerged. So I put it here for future reference.
python train_3d.py --outdir=result --data=/home/ead/AIData/img/04460130 --camera_path /home/ead/AIData/camera --gpus=1 --batch=4 --gamma=40 --data_camera_mode shapenet_car --dmtet_scale 1.0 --use_shapenet_split 1 --one_3d_generator 1 --fp32 0 --iso_surface flexicubes/home/ead/anaconda3/envs/get3d/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 35 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

==> start
==> use shapenet dataset
==> use shapenet folder number 70
==> use image path: /home/ead/AIData/img/04460130, num images: 1680
==> launch training

Training options:
{
"G_kwargs": {
"class_name": "training.networks_get3d.GeneratorDMTETMesh",
"z_dim": 512,
"w_dim": 512,
"mapping_kwargs": {
"num_layers": 8
},
"iso_surface": "flexicubes",
"one_3d_generator": true,
"n_implicit_layer": 1,
"deformation_multiplier": 1.0,
"use_style_mixing": true,
"dmtet_scale": 1.0,
"feat_channel": 16,
"mlp_latent_channel": 32,
"tri_plane_resolution": 256,
"n_views": 1,
"render_type": "neural_render",
"use_tri_plane": true,
"tet_res": 90,
"geometry_type": "conv3d",
"data_camera_mode": "shapenet_car",
"channel_base": 32768,
"channel_max": 512,
"fused_modconv_default": "inference_only"
},
"D_kwargs": {
"class_name": "training.networks_get3d.Discriminator",
"block_kwargs": {
"freeze_layers": 0
},
"mapping_kwargs": {},
"epilogue_kwargs": {
"mbstd_group_size": 4
},
"data_camera_mode": "shapenet_car",
"add_camera_cond": true,
"channel_base": 32768,
"channel_max": 512,
"architecture": "skip"
},
"G_opt_kwargs": {
"class_name": "torch.optim.Adam",
"betas": [
0,
0.99
],
"eps": 1e-08,
"lr": 0.002
},
"D_opt_kwargs": {
"class_name": "torch.optim.Adam",
"betas": [
0,
0.99
],
"eps": 1e-08,
"lr": 0.002
},
"loss_kwargs": {
"class_name": "training.loss.StyleGAN2Loss",
"gamma_mask": 40.0,
"r1_gamma": 40.0,
"lambda_flexicubes_surface_reg": 0.5,
"lambda_flexicubes_weights_reg": 0.1,
"style_mixing_prob": 0.9,
"pl_weight": 0.0
},
"data_loader_kwargs": {
"pin_memory": true,
"prefetch_factor": 2,
"num_workers": 3
},
"inference_vis": false,
"training_set_kwargs": {
"class_name": "training.dataset.ImageFolderDataset",
"path": "/home/ead/AIData/img/04460130",
"use_labels": false,
"max_size": 1680,
"xflip": false,
"resolution": 1024,
"data_camera_mode": "shapenet_car",
"add_camera_cond": true,
"camera_path": "/home/ead/AIData/camera",
"split": "train",
"random_seed": 0
},
"resume_pretrain": null,
"D_reg_interval": 16,
"num_gpus": 1,
"batch_size": 4,
"batch_gpu": 4,
"metrics": [
"fid50k"
],
"total_kimg": 20000,
"kimg_per_tick": 1,
"image_snapshot_ticks": 50,
"network_snapshot_ticks": 200,
"random_seed": 0,
"ema_kimg": 1.25,
"G_reg_interval": 4,
"run_dir": "result/00060-stylegan2-04460130-gpus1-batch4-gamma40"
}

Output directory: result/00060-stylegan2-04460130-gpus1-batch4-gamma40
Number of GPUs: 1
Batch size: 4 images
Training duration: 20000 kimg
Dataset path: /home/ead/AIData/img/04460130
Dataset size: 1680 images
Dataset resolution: 1024
Dataset labels: False
Dataset x-flips: False

Creating output directory...
Launching processes...
Setting up PyTorch plugin "upfirdn2d_plugin"... Done.
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "filtered_lrelu_plugin"... Done.
Loading training set...
==> use shapenet dataset
==> use shapenet folder number 70
==> use image path: /home/ead/AIData/img/04460130, num images: 1680

Num images: 1680
Image shape: [3, 1024, 1024]
Label shape: [0]

Constructing networks...
Setting up augmentation...
Distributing across 1 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 20000 kimg...

tick 0 kimg 0.0 time 26s sec/tick 16.5 sec/kimg 4134.89 maintenance 9.2
==> start visualization
/home/ead/docker/GET3D/training/networks_get3d.py:467: UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end).
camera_theta = torch.range(0, n_camera - 1, device=self.device).unsqueeze(dim=-1) / n_camera * math.pi * 2.0
==> saved visualization
Evaluating metrics...
====> use validation set
==> use shapenet dataset
==> use shapenet folder number 29
==> use image path: /home/ead/AIData/img/04460130, num images: 696
==> preparing the cache for fid scores
{'class_name': 'training.dataset.ImageFolderDataset', 'path': '/home/ead/AIData/img/04460130', 'use_labels': False, 'max_size': None, 'xflip': False, 'resolution': 1024, 'data_camera_mode': 'shapenet_car', 'add_camera_cond': True, 'camera_path': '/home/ead/AIData/camera', 'split': 'val', 'random_seed': 0}
0%| | 0/11 [00:00<?, ?it/s]/home/ead/anaconda3/envs/get3d/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
27%|##7 | 3/11 [00:09<00:18, 2.25s/it]Killed
(get3d) ead@BF-Work:~/docker/GET3D$ /home/ead/anaconda3/envs/get3d/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 35 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

@SteveJunGao
Copy link
Collaborator

Hi @EadmondDai,

From the error message you posted, it seems the model has been killed due to the lack of resources, what's your training platform? (e.g. the GPU/CPU, their memory sizes and the systems?)

@Remember12344
Copy link

Hi @EadmondDai,

From the error message you posted, it seems the model has been killed due to the lack of resources, what's your training platform? (e.g. the GPU/CPU, their memory sizes and the systems?)

could i only use RTX4090 to train a model?

@ben14132
Copy link

ben14132 commented Jul 1, 2024

Hi, I am having a similar issue.

==> start
==> use shapenet dataset
==> use shapenet folder number 0
==> use image path: /home/jovyan/render_shapenet_data/content/GET3D/render_shapenet_data/save/img/14bb2e591332db56b0be6ed024602be5, num images: 0
==> launch training

Training options:
{
"G_kwargs": {
"class_name": "training.networks_get3d.GeneratorDMTETMesh",
"z_dim": 512,
"w_dim": 512,
"mapping_kwargs": {
"num_layers": 8
},
"iso_surface": "dmtet",
"one_3d_generator": true,
"n_implicit_layer": 1,
"deformation_multiplier": 1.0,
"use_style_mixing": true,
"dmtet_scale": 1.0,
"feat_channel": 16,
"mlp_latent_channel": 32,
"tri_plane_resolution": 256,
"n_views": 1,
"render_type": "neural_render",
"use_tri_plane": true,
"tet_res": 90,
"geometry_type": "conv3d",
"data_camera_mode": "shapenet_car",
"channel_base": 32768,
"channel_max": 512,
"fused_modconv_default": "inference_only"
},
"D_kwargs": {
"class_name": "training.networks_get3d.Discriminator",
"block_kwargs": {
"freeze_layers": 0
},
"mapping_kwargs": {},
"epilogue_kwargs": {
"mbstd_group_size": 4
},
"data_camera_mode": "shapenet_car",
"add_camera_cond": true,
"channel_base": 32768,
"channel_max": 512,
"architecture": "skip"
},
"G_opt_kwargs": {
"class_name": "torch.optim.Adam",
"betas": [
0,
0.99
],
"eps": 1e-08,
"lr": 0.002
},
"D_opt_kwargs": {
"class_name": "torch.optim.Adam",
"betas": [
0,
0.99
],
"eps": 1e-08,
"lr": 0.002
},
"loss_kwargs": {
"class_name": "training.loss.StyleGAN2Loss",
"gamma_mask": 40.0,
"r1_gamma": 40.0,
"lambda_flexicubes_surface_reg": 0.5,
"lambda_flexicubes_weights_reg": 0.1,
"style_mixing_prob": 0.9,
"pl_weight": 0.0
},
"data_loader_kwargs": {
"pin_memory": true,
"prefetch_factor": 2,
"num_workers": 3
},
"inference_vis": false,
"training_set_kwargs": {
"class_name": "training.dataset.ImageFolderDataset",
"path": "/home/jovyan/render_shapenet_data/content/GET3D/render_shapenet_data/save/img/14bb2e591332db56b0be6ed024602be5",
"use_labels": false,
"max_size": 0,
"xflip": false,
"resolution": 1024,
"data_camera_mode": "shapenet_car",
"add_camera_cond": true,
"camera_path": "/home/jovyan/render_shapenet_data/content/GET3D/render_shapenet_data/save/camera",
"split": "train",
"random_seed": 0
},
"resume_pretrain": null,
"D_reg_interval": 16,
"num_gpus": 1,
"batch_size": 32,
"batch_gpu": 4,
"metrics": [
"fid50k"
],
"total_kimg": 20000,
"kimg_per_tick": 1,
"image_snapshot_ticks": 50,
"network_snapshot_ticks": 200,
"random_seed": 0,
"ema_kimg": 10.0,
"G_reg_interval": 4,
"run_dir": "/home/jovyan/results/00004-stylegan2-14bb2e591332db56b0be6ed024602be5-gpus1-batch32-gamma40"
}

Output directory: /home/jovyan/results/00004-stylegan2-14bb2e591332db56b0be6ed024602be5-gpus1-batch32-gamma40
Number of GPUs: 1
Batch size: 32 images
Training duration: 20000 kimg
Dataset path: /home/jovyan/render_shapenet_data/content/GET3D/render_shapenet_data/save/img/14bb2e591332db56b0be6ed024602be5
Dataset size: 0 images
Dataset resolution: 1024
Dataset labels: False
Dataset x-flips: False

Creating output directory...
Launching processes...
Setting up PyTorch plugin "upfirdn2d_plugin"... /usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Done.
Setting up PyTorch plugin "bias_act_plugin"... /usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Done.
Setting up PyTorch plugin "filtered_lrelu_plugin"... /usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Done.
Loading training set...
==> use shapenet dataset
==> use shapenet folder number 0
==> use image path: /home/jovyan/render_shapenet_data/content/GET3D/render_shapenet_data/save/img/14bb2e591332db56b0be6ed024602be5, num images: 0
Traceback (most recent call last):
File "train_3d.py", line 337, in
main() # pylint: disable=no-value-for-parameter
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "train_3d.py", line 331, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "train_3d.py", line 103, in launch_training
subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
File "train_3d.py", line 49, in subprocess_fn
training_loop_3d.training_loop(rank=rank, **c)
File "/home/jovyan/GET3D/training/training_loop_3d.py", line 134, in training_loop
training_set_sampler = misc.InfiniteSampler(
File "/home/jovyan/GET3D/torch_utils/misc.py", line 120, in init
assert len(dataset) > 0
AssertionError

Could I check how the structure of the data file should be like?

@Remember12344
Copy link

Remember12344 commented Jul 1, 2024 via email

@ben14132
Copy link

ben14132 commented Jul 1, 2024

Hi @Remember12344
The training images are in the folder but I am still having this issue. Could I check that the data file path should be directly to the folder containing the images?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants