You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training on tesla v100, e.g.,
The training on VG dataset can be fed with 12 images at a time, however, it seems one card can only validate one image at a time during the validation process? Is there any chance to validate 12 images at one time during validation?
[07/01 13:31:28 pysgg]: relness module pretraining..
[07/01 13:31:28 pysgg]: Start validating
[07/01 13:31:28 pysgg]: Start evaluation on VG_stanford_filtered_with_attribute_val dataset(5000 images).
0%| | 0/417 [00:06<?, ?it/s]
Traceback (most recent call last):
File "tools/relation_train_net.py", line 714, in
main()
File "tools/relation_train_net.py", line 705, in main
model = train(cfg, args.local_rank, args.distributed, logger)
File "tools/relation_train_net.py", line 496, in train
val_result = run_val(cfg, model, val_data_loaders, distributed, logger)
File "tools/relation_train_net.py", line 565, in run_val
logger=logger,
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/engine/inference.py", line 123, in inference
timer=inference_timer, logger=logger)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/engine/inference.py", line 41, in compute_on_dataset
output = model(images.to(device), targets, logger=logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/detector/generalized_rcnn.py", line 52, in forward
x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/roi_heads.py", line 69, in forward
x, detections, loss_relation = self.relation(features, detections, targets, logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/relation_head.py", line 215, in forward
logger,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/roi_relation_predictors.py", line 604, in forward
roi_features, union_features, inst_proposals, rel_pair_idxs, rel_binarys, logger
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/model_bgnn.py", line 796, in forward
rel_pair_inds,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/model_msg_passing.py", line 261, in forward
obj_embed_by_pred_dist = self.obj_embed_on_prob_dist(obj_labels.long())
AttributeError: 'NoneType' object has no attribute 'long'
``
The text was updated successfully, but these errors were encountered:
hszhoushen
changed the title
bgnn problem
bgnn training problem during validation processing (images_per_batch can only be one when at validation process)
Jul 1, 2022
When training on tesla v100, e.g.,
The training on VG dataset can be fed with 12 images at a time, however, it seems one card can only validate one image at a time during the validation process? Is there any chance to validate 12 images at one time during validation?
Training .sh
python tools/relation_train_net.py \ --config-file "configs/e2e_relBGNN_vg.yaml" \ DEBUG False \ EXPERIMENT_NAME "BGNN-PreCls" \ SOLVER.IMS_PER_BATCH $[3*4] \ TEST.IMS_PER_BATCH $[4] \ SOLVER.VAL_PERIOD 3000 \ SOLVER.CHECKPOINT_PERIOD 3000 \ MODEL.ROI_RELATION_HEAD.USE_GT_BOX True \ MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL True \
Problem encountered:
``
instance name: sgdet-BGNNPredictor/(2022-07-01_13)BGNN-PreCls(resampling)
elapsed time: 0:06:51
eta: 3 days, 7:48:18
iter: 100/70000
loss: 0.6129 (0.7214)
loss_rel: 0.1183 (0.1323)
pre_rel_classify_loss_iter-0: 0.1641 (0.2069)
pre_rel_classify_loss_iter-1: 0.1628 (0.1891)
pre_rel_classify_loss_iter-2: 0.1618 (0.1932)
time: 3.9448 (4.1101)
data: 0.0559 (0.0689)
lr: 0.026707
max mem: 19994
[07/01 13:31:28 pysgg]: relness module pretraining..
[07/01 13:31:28 pysgg]: Start validating
[07/01 13:31:28 pysgg]: Start evaluation on VG_stanford_filtered_with_attribute_val dataset(5000 images).
0%| | 0/417 [00:06<?, ?it/s]
Traceback (most recent call last):
File "tools/relation_train_net.py", line 714, in
main()
File "tools/relation_train_net.py", line 705, in main
model = train(cfg, args.local_rank, args.distributed, logger)
File "tools/relation_train_net.py", line 496, in train
val_result = run_val(cfg, model, val_data_loaders, distributed, logger)
File "tools/relation_train_net.py", line 565, in run_val
logger=logger,
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/engine/inference.py", line 123, in inference
timer=inference_timer, logger=logger)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/engine/inference.py", line 41, in compute_on_dataset
output = model(images.to(device), targets, logger=logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/detector/generalized_rcnn.py", line 52, in forward
x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/roi_heads.py", line 69, in forward
x, detections, loss_relation = self.relation(features, detections, targets, logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/relation_head.py", line 215, in forward
logger,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/roi_relation_predictors.py", line 604, in forward
roi_features, union_features, inst_proposals, rel_pair_idxs, rel_binarys, logger
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/model_bgnn.py", line 796, in forward
rel_pair_inds,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/model_msg_passing.py", line 261, in forward
obj_embed_by_pred_dist = self.obj_embed_on_prob_dist(obj_labels.long())
AttributeError: 'NoneType' object has no attribute 'long'
``
The text was updated successfully, but these errors were encountered: