Skip to content
This repository was archived by the owner on Nov 21, 2023. It is now read-only.

GPU utilization become zero after long term training #19

Closed
gaopeng-eugene opened this issue Jan 24, 2018 · 5 comments
Closed

GPU utilization become zero after long term training #19

gaopeng-eugene opened this issue Jan 24, 2018 · 5 comments

Comments

@gaopeng-eugene
Copy link

screen shot 2018-01-24 at 4 06 04 pm

screen shot 2018-01-24 at 4 05 56 pm

@gaopeng-eugene
Copy link
Author

python2 tools/train_net.py
--cfg configs/getting_started/tutorial_8gpu_e2e_faster_rcnn_R-50-FPN.yaml
OUTPUT_DIR /tmp/detectron-output

problem the GPU utilization become zero, however CPU is using a lot resource.

@gaopeng-eugene
Copy link
Author

terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:230] error == cudaSuccess. 4 vs 0. Error at: /home/pgao/caffe2/caffe2/caffe2/core/context_gpu.h:230: unspecified launch failure Error from operator:
input: "gpu_0/res5_0_branch2c_w_grad" output: "gpu_2/res5_0_branch2c_w_grad" name: "" type: "Copy" device_option { device_type: 1 cuda_gpu_id: 2 }
terminate called recursively
*** Aborted at 1516782983 (unix time) try "date -d @1516782983" if you are using GNU date ***
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively

@terrychenism
Copy link

same problem...

@rbgirshick
Copy link
Contributor

rbgirshick commented Jan 29, 2018

@gaopeng-eugene, @terrychenism: can you try switching to the NCCL implementation of AllReduce to see if that resolves the problem? Instructions for building Caffe2 with NCCL support and enabling NCCL in Detectron can be found in #32.

@terrychenism
Copy link

fix it by using UCCL

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants
@rbgirshick @terrychenism @gaopeng-eugene and others