Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with unet training. Capabilities problem ? #4

Closed
mcblache opened this issue Apr 24, 2024 · 2 comments
Closed

Error with unet training. Capabilities problem ? #4

mcblache opened this issue Apr 24, 2024 · 2 comments

Comments

@mcblache
Copy link

Hello,

While using biapy-gui to train a unet network, we encountered an unexpected stop with this error message:

Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")

Except we have a nvidia gpu correctly configured and correctly detected by biapy.

nvidia-smi 
Wed Apr 24 16:57:38 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T600 Lap...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8     2W /  35W |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |

On a other computer with an another card (GeForce RTX 3090), with the same installation, same nvidia driver the unet training is working correctly.

Same installation, same nvidia driver but different capabilities!

  • nvidia T600 Laptop GPU => compute capability 7.5
  • nvidia card GeForce RTX 3090 => compute capability 8.6

cf https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications.

We suspect that you use Bfloat16 unavailable for card with 7.x capabilities

Thanks

Here is the log:

Using biapyx/biapy:latest-11.8 container
Local GUI version: v1.0.6
Remote last version's hash: ffb24581dc7263a1aebbe076df443de37709ebf5
Remote last version: v1.0.6
Loaded: {'AUGMENTOR': {'ENABLE': False}, 'DATA': {'EXTRACT_RANDOM_PATCH': False, 'FORCE_RGB': True, 'PATCH_SIZE': '(256, 256, 3)', 'REFLECT_TO_COMPLETE_SHAPE': True, 'TEST': {'ARGMAX_TO_OUTPUT': True, 'CHECK_DATA': True, 'IN_MEMORY': True, 'LOAD_GT': False, 'OVERLAP': '(0,0)', 'PADDING': '(64, 64)', 'PATH': '/home/mcblache/prj/pepper/d10/test/images', 'RESOLUTION': '(1,1)'}, 'TRAIN': {'CHECK_DATA': True, 'GT_PATH': '/home/mcblache/prj/pepper/d10/test/masks', 'IN_MEMORY': True, 'MINIMUM_FOREGROUND_PER': 0.05, 'OVERLAP': '(0,0)', 'PADDING': '(0,0)', 'PATH': '/home/mcblache/prj/pepper/d10/test/images'}, 'VAL': {'FROM_TRAIN': True, 'RANDOM': True, 'RESOLUTION': '(1,1)', 'SPLIT_TRAIN': 0.1}}, 'MODEL': {'ARCHITECTURE': 'unet', 'DROPOUT_VALUES': [0.0, 0.0, 0.0, 0.0, 0.0], 'FEATURE_MAPS': [16, 32, 64, 128, 256]}, 'PROBLEM': {'NDIM': '2D', 'SEMANTIC_SEG': {'IGNORE_CLASS_ID': '0'}, 'TYPE': 'SEMANTIC_SEG'}, 'SYSTEM': {'NUM_CPUS': -1, 'NUM_WORKERS': 0, 'SEED': 0}, 'TEST': {'ENABLE': True, 'EVALUATE': True, 'VERBOSE': True}, 'TRAIN': {'ACCUM_ITER': 1, 'BATCH_SIZE': 2, 'ENABLE': True, 'EPOCHS': 10, 'LR': 0.001, 'LR_SCHEDULER': {'NAME': 'onecycle'}, 'OPTIMIZER': 'ADAMW', 'OPT_BETAS': '(0.9, 0.999)', 'PATIENCE': 2, 'W_DECAY': 0.02}}
Setting AUGMENTOR__ENABLE__INPUT : No (ENABLE)
...
Possible expected error during closing spin window: Internal C++ object (load_yaml_to_GUI_engine) already deleted.
Creating YAML file
{'status': 'Pulling from biapyx/biapy', 'id': 'latest-11.8'}
{'status': 'Digest: sha256:5b55f044be436fd00a82dd51b6f89e6411154bb247ab5041fb80179b33a5323d'}
{'status': 'Status: Image is up to date for biapyx/biapy:latest-11.8'}
Creating temporal input YAML file
Command: ['--config', '/BiaPy_files/input.yaml', '--result_dir', '/home/mcblache/prj/pepper/output', '--name', 'my_2d_semantic_segmentation', '--run_id', '1', '--dist_backend', 'nccl', '--gpu', '0']
Volumes:  {'/home/mcblache/prj/pepper/output/my_2d_semantic_segmentation/input_config/input20240424_163742.yaml': {'bind': '/BiaPy_files/input.yaml', 'mode': 'ro'}, '/home/mcblache/prj/pepper/output': {'bind': '/home/mcblache/prj/pepper/output', 'mode': 'rw'}, '/home/mcblache/prj/pepper/d10/test': {'bind': '/home/mcblache/prj/pepper/d10/test', 'mode': 'ro'}}
GPU (IDs): 0
CPUs: 5
GUI version: v1.0.6
Traceback (most recent call last):
  File "docker/api/client.py", line 268, in _raise_for_status
  File "requests/models.py", line 1021, in raise_for_status
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.45/containers/f4db26e13419ef76836e1dcc5f895815ae96a4770f38464ef27d1bf71e31dc20/start

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_functions.py", line 434, in run
  File "docker/models/containers.py", line 854, in run
  File "docker/models/containers.py", line 405, in start
  File "docker/utils/decorators.py", line 19, in wrapped
  File "docker/api/container.py", line 1126, in start
  File "docker/api/client.py", line 270, in _raise_for_status
  File "docker/errors.py", line 39, in create_api_error_from_http_exception
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.45/containers/f4db26e13419ef76836e1dcc5f895815ae96a4770f38464ef27d1bf71e31dc20/start: Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")

Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")
@danifranco
Copy link
Collaborator

Hello,

Thank you for reporting this error. We will look into it carefully so we can fix/avoid or warn the user with such GPU in futures GUI releases.

Cheers,

@danifranco
Copy link
Collaborator

Hello,

I've just found an interesting discussion on this problem where a few solutions and link to interesting tutorials are provided. Most of the times seems that a simple sudo systemctl restart docker does the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants