Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with docker image #707

Closed
jrinck opened this issue Aug 20, 2023 · 11 comments
Closed

issue with docker image #707

jrinck opened this issue Aug 20, 2023 · 11 comments
Assignees

Comments

@jrinck
Copy link

jrinck commented Aug 20, 2023

I built the docker image using the instructons here: https://github.com/h2oai/h2ogpt/blob/main/docs/README_DOCKER.md

I use the command to run the image and I am getting this error:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

Anyone have any ideas on how to solve this?

@arnocandel
Copy link
Member

make sure that the nvidia container install went through without errors
https://github.com/h2oai/h2ogpt/blob/main/docs/README_DOCKER.md#setup-docker-for-gpu-inference

@achraf-mer
Copy link
Collaborator

likely a issue with the nvidia container toolkit, could you please try to reinstall it, and make sure to restart the dockerd afterwards, this method seemed to work for lots of people here: NVIDIA/nvidia-docker#1034 (comment)

@jrinck
Copy link
Author

jrinck commented Aug 21, 2023

I had follwed the instructions to install the nvida container as mentioned above. I may have missed an error when i did that initiallly I re-ran them again to make sure. Now I get a new set of errors.

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to st art container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so .1: cannot open shared object file: no such file or directory: unknown.

Any ideas?

@achraf-mer
Copy link
Collaborator

achraf-mer commented Aug 21, 2023

@jrinck can you please make sure you have both the docker server and client at same version, I see comments here with similar issue and resolution suggests to either reinstall docker or make sure versions are consistent: NVIDIA/nvidia-container-toolkit#250 (see two comments up)
If issue persist may I suggest to open an issue in nvidia-docker repo? we can probably get to a resolution quicker with the gpu install.

@pseudotensor
Copy link
Collaborator

@achraf-mer why would one need cudnn installed outside docker image? I don't think that should be required.

@achraf-mer
Copy link
Collaborator

@achraf-mer why would one need cudnn installed outside docker image? I don't think that should be required.

it may not be required, but if installed it should be the latest versions, I can't find the link I read earlier but it did mention cudnn as possible culprit (at least things worked after upgrading).

@pseudotensor
Copy link
Collaborator

I just don't think it must be true and we shouldn't be having people update cudnn on bare metal when it's not required.

One may require to update nvidia drivers, but that's it.

@achraf-mer
Copy link
Collaborator

achraf-mer commented Aug 21, 2023

I just don't think it must be true and we shouldn't be having people update cudnn on bare metal when it's not required.

One may require to update nvidia drivers, but that's it.

ok, removed my suggestion from original comment.

@jrinck
Copy link
Author

jrinck commented Aug 22, 2023

I did notice that the instructions here

https://git0hub.com/h2oai/h2ogpt/blob/08abc0cd6d57ea66995255fc8dfb9e2faae688ff/docs/README_DOCKER.md#setup-docker-for-gpu-inference

had a different distrubution than these instructions

NVIDIA/nvidia-docker#1034 (comment)

Baiscally

https://nvidia.github.io/libnvidia-container

vs

https://nvidia.github.io/nvidia-docker

I have tried both and still end up with the same error

docker: Error response from daemon: failed to create task for container: failed to create shask: OCI runtime create failed: runc create failed: unable to start container process: erroring container init: error running hook #0: error running hook: exit status 1, stdout: , stdeAuto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot oshared object file: no such file or directory: unknown.

I have a POC tomorrow and would love to get this solved.

I have H20GPT running but only using CPUs and it is slow

@achraf-mer
Copy link
Collaborator

achraf-mer commented Aug 22, 2023

@jrinck can you please post a docker version and which linux distro you are using.

also, did you attempt to run the container with sudo, perhaps you initially installed docker and nvidia toolkit with sudo, example to run:

sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

also about the instructions, the definitive install steps are https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#setting-up-nvidia-container-toolkit, which is the source we used for the README_DOCKER.md

@jrinck
Copy link
Author

jrinck commented Aug 24, 2023

Sorry for the delayed response. My demo/POC was the other day. I had to use a VM that only had CPUs. It was pretty slow. I deleted the other VM that had GPUs so I do not have the docker version.

I am going to rebuild next week. Let's close this and we can pick it up after I rebuild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants