Error with unet training. Capabilities problem ? #4

mcblache · 2024-04-24T15:36:18Z

Hello,

While using biapy-gui to train a unet network, we encountered an unexpected stop with this error message:

Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")

Except we have a nvidia gpu correctly configured and correctly detected by biapy.

nvidia-smi 
Wed Apr 24 16:57:38 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T600 Lap...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8     2W /  35W |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |

On a other computer with an another card (GeForce RTX 3090), with the same installation, same nvidia driver the unet training is working correctly.

Same installation, same nvidia driver but different capabilities!

nvidia T600 Laptop GPU => compute capability 7.5
nvidia card GeForce RTX 3090 => compute capability 8.6

cf https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications.

We suspect that you use Bfloat16 unavailable for card with 7.x capabilities

Thanks

Here is the log:

Using biapyx/biapy:latest-11.8 container
Local GUI version: v1.0.6
Remote last version's hash: ffb24581dc7263a1aebbe076df443de37709ebf5
Remote last version: v1.0.6
Loaded: {'AUGMENTOR': {'ENABLE': False}, 'DATA': {'EXTRACT_RANDOM_PATCH': False, 'FORCE_RGB': True, 'PATCH_SIZE': '(256, 256, 3)', 'REFLECT_TO_COMPLETE_SHAPE': True, 'TEST': {'ARGMAX_TO_OUTPUT': True, 'CHECK_DATA': True, 'IN_MEMORY': True, 'LOAD_GT': False, 'OVERLAP': '(0,0)', 'PADDING': '(64, 64)', 'PATH': '/home/mcblache/prj/pepper/d10/test/images', 'RESOLUTION': '(1,1)'}, 'TRAIN': {'CHECK_DATA': True, 'GT_PATH': '/home/mcblache/prj/pepper/d10/test/masks', 'IN_MEMORY': True, 'MINIMUM_FOREGROUND_PER': 0.05, 'OVERLAP': '(0,0)', 'PADDING': '(0,0)', 'PATH': '/home/mcblache/prj/pepper/d10/test/images'}, 'VAL': {'FROM_TRAIN': True, 'RANDOM': True, 'RESOLUTION': '(1,1)', 'SPLIT_TRAIN': 0.1}}, 'MODEL': {'ARCHITECTURE': 'unet', 'DROPOUT_VALUES': [0.0, 0.0, 0.0, 0.0, 0.0], 'FEATURE_MAPS': [16, 32, 64, 128, 256]}, 'PROBLEM': {'NDIM': '2D', 'SEMANTIC_SEG': {'IGNORE_CLASS_ID': '0'}, 'TYPE': 'SEMANTIC_SEG'}, 'SYSTEM': {'NUM_CPUS': -1, 'NUM_WORKERS': 0, 'SEED': 0}, 'TEST': {'ENABLE': True, 'EVALUATE': True, 'VERBOSE': True}, 'TRAIN': {'ACCUM_ITER': 1, 'BATCH_SIZE': 2, 'ENABLE': True, 'EPOCHS': 10, 'LR': 0.001, 'LR_SCHEDULER': {'NAME': 'onecycle'}, 'OPTIMIZER': 'ADAMW', 'OPT_BETAS': '(0.9, 0.999)', 'PATIENCE': 2, 'W_DECAY': 0.02}}
Setting AUGMENTOR__ENABLE__INPUT : No (ENABLE)
...
Possible expected error during closing spin window: Internal C++ object (load_yaml_to_GUI_engine) already deleted.
Creating YAML file
{'status': 'Pulling from biapyx/biapy', 'id': 'latest-11.8'}
{'status': 'Digest: sha256:5b55f044be436fd00a82dd51b6f89e6411154bb247ab5041fb80179b33a5323d'}
{'status': 'Status: Image is up to date for biapyx/biapy:latest-11.8'}
Creating temporal input YAML file
Command: ['--config', '/BiaPy_files/input.yaml', '--result_dir', '/home/mcblache/prj/pepper/output', '--name', 'my_2d_semantic_segmentation', '--run_id', '1', '--dist_backend', 'nccl', '--gpu', '0']
Volumes:  {'/home/mcblache/prj/pepper/output/my_2d_semantic_segmentation/input_config/input20240424_163742.yaml': {'bind': '/BiaPy_files/input.yaml', 'mode': 'ro'}, '/home/mcblache/prj/pepper/output': {'bind': '/home/mcblache/prj/pepper/output', 'mode': 'rw'}, '/home/mcblache/prj/pepper/d10/test': {'bind': '/home/mcblache/prj/pepper/d10/test', 'mode': 'ro'}}
GPU (IDs): 0
CPUs: 5
GUI version: v1.0.6
Traceback (most recent call last):
  File "docker/api/client.py", line 268, in _raise_for_status
  File "requests/models.py", line 1021, in raise_for_status
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.45/containers/f4db26e13419ef76836e1dcc5f895815ae96a4770f38464ef27d1bf71e31dc20/start

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_functions.py", line 434, in run
  File "docker/models/containers.py", line 854, in run
  File "docker/models/containers.py", line 405, in start
  File "docker/utils/decorators.py", line 19, in wrapped
  File "docker/api/container.py", line 1126, in start
  File "docker/api/client.py", line 270, in _raise_for_status
  File "docker/errors.py", line 39, in create_api_error_from_http_exception
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.45/containers/f4db26e13419ef76836e1dcc5f895815ae96a4770f38464ef27d1bf71e31dc20/start: Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")

Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")

The text was updated successfully, but these errors were encountered:

danifranco · 2024-04-26T17:07:01Z

Hello,

Thank you for reporting this error. We will look into it carefully so we can fix/avoid or warn the user with such GPU in futures GUI releases.

Cheers,

danifranco · 2024-05-21T14:51:17Z

Hello,

I've just found an interesting discussion on this problem where a few solutions and link to interesting tutorials are provided. Most of the times seems that a simple sudo systemctl restart docker does the trick.

danifranco transferred this issue from BiaPyX/BiaPy May 15, 2024

danifranco mentioned this issue May 16, 2024

Memory error in UNET prediction with patches BiaPyX/BiaPy#81

Closed

danifranco closed this as completed Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with unet training. Capabilities problem ? #4

Error with unet training. Capabilities problem ? #4

mcblache commented Apr 24, 2024

danifranco commented Apr 26, 2024

danifranco commented May 21, 2024

Error with unet training. Capabilities problem ? #4

Error with unet training. Capabilities problem ? #4

Comments

mcblache commented Apr 24, 2024

danifranco commented Apr 26, 2024

danifranco commented May 21, 2024