Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GPU operator #44

Merged
merged 1 commit into from
Jun 8, 2022
Merged

Update GPU operator #44

merged 1 commit into from
Jun 8, 2022

Conversation

neoaggelos
Copy link
Contributor

Summary

Update GPU operator to version 1.10.1

Closes #34
Closes canonical/microk8s#3218

Testing

Install and test on an AWS g3s.xlarge instance. Without this change, the GPG key error comes up. With this change, the driver is properly built and loaded:

root@ip-172-31-12-67:/home/ubuntu# microk8s.kubectl logs -n gpu-operator-resources nvidia-driver-daemonset-2thc7
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-510.47.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that an NVIDIA kernel module matching this driver version is installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 510.47.03 for Linux kernel version 5.13.0-1025-aws

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.13.0-1025-aws
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
/usr/src/nvidia-510.47.03/kernel/nvidia-uvm/uvm.c: In function 'uvm_mmap':
/usr/src/nvidia-510.47.03/kernel/nvidia-uvm/uvm.c:887:1: warning: label 'out_va_space_unlock' defined but not used [-Wunused-label]
  887 | out_va_space_unlock:
      | ^~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03/kernel/nvidia/nv-dma.c:986: warning: "IMPORT_SGT_STUBS_NEEDED" redefined
  986 | #define IMPORT_SGT_STUBS_NEEDED 0
      | 
/usr/src/nvidia-510.47.03/kernel/nvidia/nv-dma.c:980: note: this is the location of the previous definition
  980 | #define IMPORT_SGT_STUBS_NEEDED 1
      | 
/usr/src/nvidia-510.47.03/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
/usr/src/nvidia-510.47.03/kernel/nvidia-peermem/nvidia-peermem.c:445:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  445 |     int status = 0;
      |     ^~~
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-modeset.c: In function '__will_generate_flip_event':
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-modeset.c:98:10: warning: unused variable 'overlay_event' [-Wunused-variable]
   98 |     bool overlay_event = false;
      |          ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-modeset.c:97:10: warning: unused variable 'primary_event' [-Wunused-variable]
   97 |     bool primary_event = false;
      |          ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable 'primary_plane' [-Wunused-variable]
   96 |     struct drm_plane *primary_plane = crtc->primary;
      |                       ^~~~~~~~~~~~~
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'cursor_plane_req_config_update':
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c:81:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
   81 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |                                ^~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c:80:27: warning: unused variable 'nv_dev' [-Wunused-variable]
   80 |     struct nv_drm_device *nv_dev = to_nv_device(plane->dev);
      |                           ^~~~~~
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'plane_req_config_update':
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c:182:9: warning: unused variable 'ret' [-Wunused-variable]
  182 |     int ret = 0;
      |         ^~~
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_plane_atomic_set_property':
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c:497:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
  497 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |                                ^~~~~~~~~~~~~~~~~~
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_enumerate_crtcs_and_planes':
/usr/src/nvidia-510.47.03/kernel/nvidia-drm/nvidia-drm-crtc.c:1141:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
 1141 |             struct drm_plane *overlay_plane =
      |             ^~~~~~
Skipping BTF generation for /usr/src/nvidia-510.47.03/kernel/nvidia-peermem.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-510.47.03/kernel/nvidia-modeset.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-510.47.03/kernel/nvidia-drm.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-510.47.03/kernel/nvidia.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-510.47.03/kernel/nvidia-uvm.ko due to unavailability of vmlinux
Relinking NVIDIA driver kernel modules...
Building NVIDIA driver package nvidia-modules-5.13.0-1025...
Installing NVIDIA driver kernel modules...

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.


ERROR: Unable to open 'kernel/dkms.conf' for copying (No such file or directory)


Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 4 CPUs online; setting concurrency level to 4.
Installing NVIDIA driver version 510.47.03.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/5.13.0-1025-aws/build'

Kernel output path: '/lib/modules/5.13.0-1025-aws/build'

Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules
  : [##############################] 100%
Kernel module compilation complete.
Unable to determine if Secure Boot is enabled: No such file or directory
Kernel messages:
[  615.835270] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld . All Rights Reserved.
[  617.564751] Initializing XFRM netlink socket
[  620.917924] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  620.917958] IPv6: ADDRCONF(NETDEV_CHANGE): cali2f22ed99be4: link becomes ready
[  887.124147] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  887.124178] IPv6: ADDRCONF(NETDEV_CHANGE): cali1356407a1bc: link becomes ready
[  962.524126] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  962.524179] IPv6: ADDRCONF(NETDEV_CHANGE): cali86fa53a1dfb: link becomes ready
[  962.691155] IPv6: ADDRCONF(NETDEV_CHANGE): cali749c4d623d1: link becomes ready
[  962.789012] IPv6: ADDRCONF(NETDEV_CHANGE): cali6b6334de78b: link becomes ready
[  973.792945] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  973.793048] IPv6: ADDRCONF(NETDEV_CHANGE): cali8f8a6776d17: link becomes ready
[  973.902383] IPv6: ADDRCONF(NETDEV_CHANGE): cali5f1d0dcf998: link becomes ready
[ 1211.143277] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1211.159008] nvidia-nvlink: Nvlink Core is being initialized, major device number 511

[ 1211.161808] xen: --> pirq=32 -> irq=43 (gsi=43)
[ 1211.161934] nvidia 0000:00:1e.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1211.162637] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.47.03  Mon Jan 24 22:58:54 UTC 2022
[ 1211.201555] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 1211.205996] nvidia-uvm: Loaded the UVM driver, major device number 509.
[ 1211.213235] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  510.47.03  Mon Jan 24 22:51:43 UTC 2022
[ 1211.218669] nvidia-modeset: Unloading
[ 1211.238459] nvidia-uvm: Unloaded the UVM driver.
[ 1211.259234] nvidia-nvlink: Unregistered the Nvlink Core, major device number 511
Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (510.47.03):
  Installing: [##############################] 100%
Driver file installation is complete.
Running post-install sanity check:
  Checking: [##############################] 100%
Post-install sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 510.47.03) is now complete.

Parsing kernel module parameters...
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
ls: cannot access '/proc/driver/nvidia-nvswitch/devices/*': No such file or directory
Mounting NVIDIA driver rootfs...
Done, now waiting for signal


@neoaggelos neoaggelos requested a review from ktsakalozos June 7, 2022 09:37
@neoaggelos neoaggelos self-assigned this Jun 7, 2022
@nobuto-m
Copy link
Contributor

nobuto-m commented Jun 7, 2022

+1

I've manually tested this and confirmed it's working now with the same scenario with canonical/microk8s#3218.

$ microk8s kubectl logs job.batch/nvidia-smi
Tue Jun  7 10:38:17 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0    23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Addon installation log
$ bash -eux ./enable 'noop'
+ set -e
+ source /snap/microk8s/3272/actions/common/utils.sh
++ [[ ./enable == \/\s\n\a\p\/\m\i\c\r\o\k\8\s\/\3\2\7\2\/\a\c\t\i\o\n\s\/\c\o\m\m\o\n\/\u\t\i\l\s\.\s\h ]]
+ readonly CONFIG=/var/snap/microk8s/3272/args/containerd-template.toml
+ CONFIG=/var/snap/microk8s/3272/args/containerd-template.toml
+ readonly SOCKET=/var/snap/microk8s/common/run/containerd.sock
+ SOCKET=/var/snap/microk8s/common/run/containerd.sock
+ echo 'Enabling NVIDIA GPU'
Enabling NVIDIA GPU
+ read -ra ARGUMENTS
+ [[ noop == \f\o\r\c\e\-\s\y\s\t\e\m\-\d\r\i\v\e\r ]]
+ [[ noop == \f\o\r\c\e\-\o\p\e\r\a\t\o\r\-\d\r\i\v\e\r ]]
+ lsmod
+ grep nvidia
+ echo 'Using operator driver'
Using operator driver
+ readonly ENABLE_INTERNAL_DRIVER=true
+ ENABLE_INTERNAL_DRIVER=true
+ sudo mkdir -p /var/snap/microk8s/3272/var/lock
+ sudo touch /var/snap/microk8s/3272/var/lock/gpu
+ /snap/microk8s/3272/microk8s-enable.wrapper dns
Infer repository core for addon dns
Enabling DNS
Applying manifest
serviceaccount/coredns created
configmap/coredns created
deployment.apps/coredns created
service/kube-dns created
clusterrole.rbac.authorization.k8s.io/coredns created
clusterrolebinding.rbac.authorization.k8s.io/coredns created
Restarting kubelet
DNS is enabled
+ /snap/microk8s/3272/microk8s-enable.wrapper helm3
Infer repository core for addon helm3
Enabling Helm 3
Fetching helm version v3.8.0.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.9M  100 12.9M    0     0  35.1M      0 --:--:-- --:--:-- --:--:-- 35.1M
Helm 3 is enabled
+ echo 'Installing NVIDIA Operator'
Installing NVIDIA Operator
+ /snap/microk8s/3272/microk8s-helm3.wrapper repo add nvidia https://nvidia.github.io/gpu-operator
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/3272/credentials/client.config
"nvidia" has been added to your repositories
+ /snap/microk8s/3272/microk8s-helm3.wrapper repo update
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/3272/credentials/client.config
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
+ /snap/microk8s/3272/microk8s-helm3.wrapper install gpu-operator nvidia/gpu-operator --create-namespace --namespace gpu-operator-resources --version=v1.10.1 --set operator.defaultRuntime=containerd --set driver.enabled=true --set 'toolkit.env[0].name=CONTAINERD_CONFIG' --set 'toolkit.env[0].value=/var/snap/microk8s/3272/args/containerd-template.toml' --set 'toolkit.env[1].name=CONTAINERD_SOCKET' --set 'toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock'
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/3272/credentials/client.config
NAME: gpu-operator
LAST DEPLOYED: Tue Jun  7 10:29:19 2022
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
+ echo 'NVIDIA is enabled'
NVIDIA is enabled

Copy link
Member

@ktsakalozos ktsakalozos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1 thank you @neoaggelos

@ktsakalozos ktsakalozos merged commit 6ad6655 into main Jun 8, 2022
@ktsakalozos ktsakalozos deleted the MK-553/update-gpu-addon branch June 8, 2022 05:26
@ktsakalozos
Copy link
Member

@neoaggelos could we please backport this fix to the rest of the supported microk8s versions? Thank you.

@nobuto-m
Copy link
Contributor

nobuto-m commented Jun 8, 2022

Also, can you take a look at canonical/microk8s#3226? It looks like 1.21 has a different problem than a straightforward backport.

nobuto-m added a commit to nobuto-m/microk8s that referenced this pull request Jun 9, 2022
nobuto-m pushed a commit to nobuto-m/microk8s that referenced this pull request Jun 9, 2022
(cherry picked from commit b32babfc30dba2075ae4b05d72bde1da57da919d)
neoaggelos added a commit that referenced this pull request Jun 9, 2022
nobuto-m pushed a commit to nobuto-m/microk8s that referenced this pull request Jun 9, 2022
(cherry picked from commit b32babfc30dba2075ae4b05d72bde1da57da919d)
@neoaggelos neoaggelos mentioned this pull request Jun 20, 2022
@neoaggelos
Copy link
Contributor Author

Also, can you take a look at canonical/microk8s#3226? It looks like 1.21 has a different problem than a straightforward backport.

For 1.21, see canonical/microk8s#3226 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants