Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU is not detected in R, but appears in python. #1456

Closed
evanliu3594 opened this issue Jun 11, 2024 · 9 comments
Closed

GPU is not detected in R, but appears in python. #1456

evanliu3594 opened this issue Jun 11, 2024 · 9 comments

Comments

@evanliu3594
Copy link

Hi there,

I recently started moving my training environment to WSL2 to keep pace to keras3.

after following the installation guide, I successfully installed the tensorflow on my conda environment through command

keras3::install_keras(envname = "~/pyEnv/keras", backend = "tensorflow",  gpu = T)

However, when I checked tf.config in R, I found out that the GPU was not detected.

> tf$config$list_physical_devices()
2024-06-12 02:16:24.128849: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-12 02:16:24.668747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-06-12 02:16:25.456112: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[[1]]
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')

I test some code and keras worked just fine with CPU.

Then I turned to python to get more details. dramatically, the GPU just showed up.

evan@DESKTOP-KGBNUBC:~$ conda activate keras
(/home/evan/pyEnv/keras) evan@DESKTOP-KGBNUBC:~$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2024-06-12 02:21:15.036500: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-12 02:21:15.538230: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-06-12 02:21:16.242746: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-06-12 02:21:16.271831: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-06-12 02:21:16.271904: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

googled a while and found nothing similar to this. Is that I shouldn't install TF into a conda environment?

Thanks in advance for any advice.

session info is here:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
 [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tensorflow_2.16.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12       lattice_0.20-45   png_0.1-8         withr_3.0.0       zeallot_0.1.0     rappdirs_0.3.3   
 [7] R6_2.5.1          grid_4.1.2        lifecycle_1.0.4   jsonlite_1.8.8    magrittr_2.0.3    tfruns_1.5.3     
[13] rlang_1.1.4       cli_3.6.2         fs_1.6.4          rstudioapi_0.16.0 whisker_0.4.1     keras3_1.0.0     
[19] Matrix_1.4-0      reticulate_1.37.0 generics_0.1.3    keras_2.15.0      tools_4.1.2       glue_1.7.0       
[25] compiler_4.1.2    base64enc_0.1-3  
@t-kalinowski
Copy link
Member

Can you confirm that the R session is indeed finding the correct python env? What is the output of reticulate::py_config()?

@evanliu3594
Copy link
Author

Can you confirm that the R session is indeed finding the correct python env? What is the output of reticulate::py_config()?

yes, I only created 1 conda env called keras

> reticulate::py_config()
python:         /home/evan/pyEnv/keras/bin/python
libpython:      /home/evan/pyEnv/keras/lib/libpython3.11.so
pythonhome:     /home/evan/pyEnv/keras:/home/evan/pyEnv/keras
version:        3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
numpy:          /home/evan/pyEnv/keras/lib/python3.11/site-packages/numpy
numpy_version:  1.26.4
keras:          /home/evan/pyEnv/keras/lib/python3.11/site-packages/keras

NOTE: Python version was forced by use_python() function
> tf$config$list_physical_devices()
[[1]]
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')

@t-kalinowski
Copy link
Member

What a curious bug, thanks for reporting.

Just to rule some things out:

  • Do you have any startup code in .Rprofile or .Renviron that might be interfering with GPU visibility? What is the output from Sys.getenv("CUDA_VISIBLE_DEVICES") in R?
  • Does the same happen outside conda? Can you try with a venv and see if things work that way?
R -q -e 'keras3::install_keras()'
R -q -e 'library(reticulate); use_virtualenv("r-keras"); import("tensorflow")$config$list_physical_devices()'

@evanliu3594
Copy link
Author

evanliu3594 commented Jun 11, 2024

What a curious bug, thanks for reporting.

Just to rule some things out:

  • Do you have any startup code in .Rprofile or .Renviron that might be interfering with GPU visibility? What is the output from Sys.getenv("CUDA_VISIBLE_DEVICES") in R?
  • Does the same happen outside conda? Can you try with a venv and see if things work that way?
R -q -e 'keras3::install_keras()'
R -q -e 'library(reticulate); use_virtualenv("r-keras"); import("tensorflow")$config$list_physical_devices()'

Thans for the reply.
I only used the .Rprofile to set the CRAN repo to a nearer mirror site to speed up downloading, so it is quite clean.

> Sys.getenv("CUDA_VISIBLE_DEVICES")
[1] ""

I tried the shell command to install keras, and it ends out the same.

evan@DESKTOP-KGBNUBC:~$ R -q -e 'library(reticulate); use_virtualenv("r-keras"); import("tensorflow")$config$list_physical_devices()'
> library(reticulate); use_virtualenv("r-keras"); import("tensorflow")$config$list_physical_devices()
2024-06-12 03:25:20.236712: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-12 03:25:20.782594: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-06-12 03:25:21.546573: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[[1]]
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')

One thing different is, in this run install_keras() with default params, the cuda package and tensorflow-gpu is no installed.
It seems R truly did not detect my GPU.

evan@DESKTOP-KGBNUBC:~$ source .virtualenvs/r-keras/bin/activate
(r-keras) evan@DESKTOP-KGBNUBC:~$ pip list | grep tensor
tensorboard                  2.16.2
tensorboard-data-server      0.7.2
tensorflow-cpu               2.16.1
tensorflow-datasets          4.9.6
tensorflow-io-gcs-filesystem 0.37.0
tensorflow-metadata          1.15.0
(r-keras) evan@DESKTOP-KGBNUBC:~$ pip list | grep cuda
(r-keras) evan@DESKTOP-KGBNUBC:~$

I dug a little to find out that the lspci can't see the GPU in the WSL2 Ubuntu2204, but nvidia-smi worked.

evan@DESKTOP-KGBNUBC:~$ lspci
4d66:00:00.0 SCSI storage controller: Red Hat, Inc. Virtio console (rev 01)
6e30:00:00.0 System peripheral: Red Hat, Inc. Virtio file system (rev 01)
d98b:00:00.0 3D controller: Microsoft Corporation Device 008e
evan@DESKTOP-KGBNUBC:~$ nvidia-smi
Wed Jun 12 03:43:34 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01              Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 39%   37C    P8             10W /  160W |    1657MiB /   8188MiB |     20%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        66      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

Then I installed keras again with gpu=TRUE, not surprisingly resulted in the same problem as the initial one, the GPU disappeared in R, but appeared in python. 🤦‍♂️

@t-kalinowski
Copy link
Member

I'll try to get on a Windows machine tomorrow and see if I can reproduce.

@evanliu3594
Copy link
Author

evanliu3594 commented Jun 12, 2024

Just an update about what I've tried.

After a whole system reinstall (including the WSL Ubuntu), I found out that I can't see GPU in python too.
Sorry for ignoring this, but until then I recalled that before using R function install_keras(), I used pip to install tensorflow package, and added some lines in the conda activate.d bash script to ensure add these nvidia packages to the system environment.

NVIDIA_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")))

for dir in $NVIDIA_DIR/*; do
    if [ -d "$dir/lib" ]; then
        export LD_LIBRARY_PATH="$dir/lib:$LD_LIBRARY_PATH"
    fi
done

I'm not sure if this is vital to ensure python can see the GPU, it is apparently not affecting R.

@t-kalinowski
Copy link
Member

Thanks, I can reproduce. This seems to be specific to TF 2.16, the GPU is visible with the identical setup using TF 2.15.

It seems that we need to do some more work on WSL with helping Tensorflow discover the nvidia shared libraries (note, we already workaround some deficiencies by creating symlinks to nvidia shared libraries in the tensorflow virtual env. This works on Linux, but is apparently not sufficient on WSL)

For now, you can fix by running this in WSL before starting the R session (Or setting the env vars in the R session before reticulate has initializing Python).

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

from: https://discuss.tensorflow.org/t/what-versions-of-cuda-and-cudnn-are-required-for-tensorflow-2-16/24711/3

Note, there is nothing specific to conda here. We still recommend using a virtualenv if possible.

I'll push an update soon making sure that the R package does this work so users don't have to.

@evanliu3594
Copy link
Author

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

Thanks a lot! That saves me from learning python again...😂

@t-kalinowski
Copy link
Member

This is fixed on main now, the workaround should not longer be necessary. Please install the development version and reinstall keras+tensorflow to test it out.

remotes::install_github("rstudio/keras3")
keras3::install_keras()
# new R session
library(keras3) # load hook hints to reticulate to use_virtualenv("r-keras")
tensorflow::tf$config$list_physical_devices()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants