Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeStageVolume fails due to value mismatch in v3.0.1 #591

Closed
kael-k opened this issue May 20, 2022 · 4 comments
Closed

NodeStageVolume fails due to value mismatch in v3.0.1 #591

kael-k opened this issue May 20, 2022 · 4 comments

Comments

@kael-k
Copy link

kael-k commented May 20, 2022

/kind bug

What happened

Some days ago I updated directpv to 3.0.0, which caused some high load problems.
After noticing it I shutdown the whole cluster in order to try to fix it next day (nb: is a home selfhosted cluster, not a work production envirnoment).
So yesterday I updated to 3.0.1 and I als updated all the nodes hosting the infrastructure (in particular I updated kubernetes to 1.22 -> 1.23).

image

After all these updates and reboots, the loads seemed to be returned back to normal, but then I noticed some pods weren't starting: apparently all volumes on node kaelk-t495-vm-k8s-worker-1 can't be mounted.

However volumes in other nodes are correctly mounted.

Ex:

❯ k get pods -n  postgres
NAME                                            READY   STATUS              RESTARTS        AGE
postgres-postgresql-ha-pgpool-b7cf668b7-xlv5c   1/1     Running             2 (2m55s ago)   33m
postgres-postgresql-ha-postgresql-0             1/1     Running             0               33m    # directpv providioned pvc mounted on node kaelk-fujitsu-futro-s920-k8s-worker-storage-1, works
postgres-postgresql-ha-postgresql-1             1/1     Running             2 (33m ago)     33m # directpv providioned pvc mounted on node kaelk-fujitsu-futro-s920-k8s-worker-storage-2, works
postgres-postgresql-ha-postgresql-2             0/1     ContainerCreating   0               44s # directpv providioned pvc mounted on node kaelk-t495-vm-k8s-worker-1, does not work
❯ k describe pod -n postgres postgres-postgresql-ha-postgresql-2
# [...]
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    2m55s                default-scheduler  Successfully assigned postgres/postgres-postgresql-ha-postgresql-2 to kaelk-t495-vm-k8s-worker-1
  Warning  FailedMount  2m54s                kubelet            MountVolume.SetUp failed for volume "hooks-scripts" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  52s                  kubelet            Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[kube-api-access-stnc5 data hooks-scripts]: timed out waiting for the condition
  Warning  FailedMount  47s (x9 over 2m55s)  kubelet            MountVolume.MountDevice failed for volume "pvc-42e77e3c-a5b8-42a3-be7d-1da30b73976a" : rpc error: code = Internal desc = drive value mismatch

Sometimes in the errors in the event apperaed 'driver name direct-csi-min-io not found in the list of registered CSI drivers'

These are the logs of the pod generated by the daemonet on the kaelk-t495-vm-k8s-worker-1 node

❯ k logs -n direct-csi-min-io direct-csi-min-io-6bfrq --all-containers
I0520 06:27:27.449453    8480 main.go:113] Version: v2.2.0-go1.18
I0520 06:27:27.449949    8480 main.go:137] Attempting to open a gRPC connection with: "unix:///csi/csi.sock"
I0520 06:27:28.451703    8480 main.go:144] Calling CSI driver to discover driver name
I0520 06:27:28.457848    8480 main.go:154] CSI driver name: "direct-csi-min-io"
I0520 06:27:28.458024    8480 node_register.go:52] Starting Registration Server at: /registration/direct-csi-min-io-reg.sock
I0520 06:27:28.458725    8480 node_register.go:61] Registration Server started at: /registration/direct-csi-min-io-reg.sock
I0520 06:27:28.459306    8480 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
I0520 06:27:28.630559    8480 main.go:80] Received GetInfo call: &InfoRequest{}
I0520 06:27:28.723334    8480 main.go:90] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
I0520 06:27:27.673977    8514 init.go:108] obtained client config successfully
I0520 06:27:27.763821    8514 webhook.go:72] Starting conversion webhook server in port: 30443
I0520 06:27:27.781133    8514 run.go:169] The conversion webhook is live!
I0520 06:27:27.781148    8514 run.go:175] identity server started
I0520 06:27:27.913475    8514 run.go:183] enabled reflink while formatting
I0520 06:27:27.915614    8514 leaderelection.go:243] attempting to acquire leader lease direct-csi-min-io/kaelk-t495-vm-k8s-worker-1-drive-controller...
I0520 06:27:27.918728    8514 leaderelection.go:243] attempting to acquire leader lease direct-csi-min-io/kaelk-t495-vm-k8s-worker-1-volume-controller...
I0520 06:27:27.924462    8514 run.go:222] node server started
I0520 06:27:27.932071    8514 metrics.go:41] Starting metrics exporter in port: 10443
I0520 06:27:27.940001    8514 leaderelection.go:253] successfully acquired lease direct-csi-min-io/kaelk-t495-vm-k8s-worker-1-volume-controller
I0520 06:27:27.940123    8514 listener.go:383] new leader detected, current leader: kaelk-t495-vm-k8s-worker-1
I0520 06:27:27.940674    8514 listener.go:376] became leader, starting
I0520 06:27:27.940907    8514 listener.go:315] Starting volume controller
I0520 06:27:27.941085    8514 reflector.go:219] Starting reflector *v1beta4.DirectCSIVolume (1m0s) from k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167
I0520 06:27:27.941196    8514 reflector.go:255] Listing and watching *v1beta4.DirectCSIVolume from k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167
I0520 06:27:27.944077    8514 leaderelection.go:253] successfully acquired lease direct-csi-min-io/kaelk-t495-vm-k8s-worker-1-drive-controller
I0520 06:27:27.944671    8514 listener.go:383] new leader detected, current leader: kaelk-t495-vm-k8s-worker-1
I0520 06:27:27.944700    8514 listener.go:376] became leader, starting
I0520 06:27:27.944882    8514 listener.go:315] Starting drive controller
I0520 06:27:27.945014    8514 reflector.go:219] Starting reflector *v1beta4.DirectCSIDrive (1m0s) from k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167
I0520 06:27:27.945030    8514 reflector.go:255] Listing and watching *v1beta4.DirectCSIDrive from k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167
I0520 06:27:28.046268    8514 utils.go:80] [sr0] wwid msmatch: t10.ATA     QEMU DVD-ROM                            QM00002 -> 
I0520 06:27:28.046491    8514 utils.go:84] [sda] modelnumber mismatch:  -> QEMU_HARDDISK
I0520 06:27:28.046529    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:27:28.046654    8514 utils.go:96] [vdb] vendor mismatch: 0x1af4 -> 
I0520 06:27:28.046812    8514 utils.go:112] [sr0] PartTableUUID mismatch: bff8ee56 -> 
I0520 06:27:28.046655    8514 utils.go:80] [sda] wwid msmatch: t10.ATA     QEMU HARDDISK                           QM00001 -> 
I0520 06:27:28.146891    8514 utils.go:80] [sr0] wwid msmatch: t10.ATA     QEMU DVD-ROM                            QM00002 -> 
I0520 06:27:28.147012    8514 utils.go:84] [sda] modelnumber mismatch:  -> QEMU_HARDDISK
I0520 06:27:28.147172    8514 utils.go:96] [vdb] vendor mismatch: 0x1af4 -> 
I0520 06:27:28.147257    8514 utils.go:112] [sr0] PartTableUUID mismatch: bff8ee56 -> 
I0520 06:27:28.147707    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:27:28.147974    8514 utils.go:80] [sda] wwid msmatch: t10.ATA     QEMU HARDDISK                           QM00001 -> 
I0520 06:27:28.347946    8514 utils.go:84] [sda] modelnumber mismatch:  -> QEMU_HARDDISK
I0520 06:27:28.348087    8514 utils.go:112] [sr0] PartTableUUID mismatch: bff8ee56 -> 
I0520 06:27:28.348270    8514 utils.go:96] [vdb] vendor mismatch: 0x1af4 -> 
I0520 06:27:28.348354    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:27:28.348990    8514 utils.go:80] [sr0] wwid msmatch: t10.ATA     QEMU DVD-ROM                            QM00002 -> 
I0520 06:27:28.348992    8514 utils.go:80] [sda] wwid msmatch: t10.ATA     QEMU HARDDISK                           QM00001 -> 
I0520 06:27:28.748768    8514 utils.go:84] [sda] modelnumber mismatch:  -> QEMU_HARDDISK
I0520 06:27:28.749039    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:27:28.749134    8514 utils.go:112] [sr0] PartTableUUID mismatch: bff8ee56 -> 
I0520 06:27:28.749396    8514 utils.go:80] [sr0] wwid msmatch: t10.ATA     QEMU DVD-ROM                            QM00002 -> 
I0520 06:27:28.749611    8514 utils.go:96] [vdb] vendor mismatch: 0x1af4 -> 
I0520 06:27:28.749740    8514 utils.go:80] [sda] wwid msmatch: t10.ATA     QEMU HARDDISK                           QM00001 -> 
I0520 06:27:29.549817    8514 utils.go:84] [sda] modelnumber mismatch:  -> QEMU_HARDDISK
I0520 06:27:29.549884    8514 utils.go:80] [sr0] wwid msmatch: t10.ATA     QEMU DVD-ROM                            QM00002 -> 
I0520 06:27:29.549981    8514 utils.go:112] [sr0] PartTableUUID mismatch: bff8ee56 -> 
I0520 06:27:29.550067    8514 utils.go:96] [vdb] vendor mismatch: 0x1af4 -> 
I0520 06:27:29.550075    8514 utils.go:80] [sda] wwid msmatch: t10.ATA     QEMU HARDDISK                           QM00001 -> 
I0520 06:27:29.550438    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:27:31.151134    8514 utils.go:80] [sr0] wwid msmatch: t10.ATA     QEMU DVD-ROM                            QM00002 -> 
I0520 06:27:31.151186    8514 utils.go:96] [vdb] vendor mismatch: 0x1af4 -> 
I0520 06:27:31.151453    8514 utils.go:112] [sr0] PartTableUUID mismatch: bff8ee56 -> 
I0520 06:27:31.151490    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:27:31.151368    8514 utils.go:80] [sda] wwid msmatch: t10.ATA     QEMU HARDDISK                           QM00001 -> 
I0520 06:27:31.151642    8514 utils.go:84] [sda] modelnumber mismatch:  -> QEMU_HARDDISK
I0520 06:27:34.352435    8514 utils.go:96] [vdb] vendor mismatch: 0x1af4 -> 
I0520 06:27:34.352658    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:27:34.352667    8514 utils.go:80] [sda] wwid msmatch: t10.ATA     QEMU HARDDISK                           QM00001 -> 
I0520 06:27:34.352761    8514 utils.go:80] [sr0] wwid msmatch: t10.ATA     QEMU DVD-ROM                            QM00002 -> 
I0520 06:27:34.352899    8514 utils.go:112] [sr0] PartTableUUID mismatch: bff8ee56 -> 
I0520 06:27:34.352915    8514 utils.go:84] [sda] modelnumber mismatch:  -> QEMU_HARDDISK
I0520 06:27:40.754549    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:27:40.754741    8514 utils.go:80] [sda] wwid msmatch: t10.ATA     QEMU HARDDISK                           QM00001 -> 
I0520 06:27:40.754554    8514 utils.go:96] [vdb] vendor mismatch: 0x1af4 -> 
I0520 06:27:40.754823    8514 utils.go:112] [sr0] PartTableUUID mismatch: bff8ee56 -> 
I0520 06:27:40.754833    8514 utils.go:84] [sda] modelnumber mismatch:  -> QEMU_HARDDISK
I0520 06:27:40.754904    8514 utils.go:80] [sr0] wwid msmatch: t10.ATA     QEMU DVD-ROM                            QM00002 -> 
I0520 06:27:27.841305    8555 connection.go:153] Connecting to unix:///csi/csi.sock
I0520 06:27:28.842737    8555 main.go:149] calling CSI driver to discover driver name
I0520 06:27:28.845290    8555 main.go:155] CSI driver name: "direct-csi-min-io"
I0520 06:27:28.845324    8555 main.go:183] ServeMux listening at "0.0.0.0:9898"
I0520 06:27:28.104943    8623 init.go:108] obtained client config successfully
I0520 06:27:28.180095    8623 reflector.go:219] Starting reflector *v1beta4.DirectCSIDrive (30s) from k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167
I0520 06:27:28.180140    8623 reflector.go:255] Listing and watching *v1beta4.DirectCSIDrive from k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167
I0520 06:27:28.280500    8623 indexer.go:53] indexer successfully synced
I0520 06:27:28.289948    8623 utils.go:174] [sda] mismatch serial QM00001          - 
I0520 06:27:28.345858    8623 utils.go:64] [/dev/vda] path mismatch: /dev/vdb -> %!v(MISSING)
I0520 06:27:28.371953    8623 utils.go:64] [/dev/vdb] path mismatch: /dev/vda -> %!v(MISSING)

[...]

I0520 06:28:27.973880    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
I0520 06:28:32.279673    8514 stage_unstage.go:39] "NodeStageVolumeRequest" volumeID="pvc-42e77e3c-a5b8-42a3-be7d-1da30b73976a" StagingTargetPath="/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-42e77e3c-a5b8-42a3-be7d-1da30b73976a/globalmount"
I0520 06:28:32.342118    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
E0520 06:28:32.342171    8514 server.go:92] GRPC error: rpc error: code = Internal desc = drive value mismatch
I0520 06:28:33.588767    8514 stage_unstage.go:39] "NodeStageVolumeRequest" volumeID="pvc-8d67ba5d-9943-4f5a-b6e8-5515c10928a3" StagingTargetPath="/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8d67ba5d-9943-4f5a-b6e8-5515c10928a3/globalmount"
I0520 06:28:33.625306    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
E0520 06:28:33.625635    8514 server.go:92] GRPC error: rpc error: code = Internal desc = drive value mismatch
I0520 06:28:35.102031    8514 stage_unstage.go:39] "NodeStageVolumeRequest" volumeID="pvc-5616ffd8-6465-4120-9f79-98e2cd847712" StagingTargetPath="/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-5616ffd8-6465-4120-9f79-98e2cd847712/globalmount"
I0520 06:28:35.140995    8514 utils.go:96] [vda] vendor mismatch: 0x1af4 -> 
E0520 06:28:35.141046    8514 server.go:92] GRPC error: rpc error: code = Internal desc = drive value mismatch
I0520 06:27:27.841305    8555 connection.go:153] Connecting to unix:///csi/csi.sock
I0520 06:27:28.842737    8555 main.go:149] calling CSI driver to discover driver name
I0520 06:27:28.845290    8555 main.go:155] CSI driver name: "direct-csi-min-io"
I0520 06:27:28.845324    8555 main.go:183] ServeMux listening at "0.0.0.0:9898"
I0520 06:27:28.104943    8623 init.go:108] obtained client config successfully

Other (maybe) useful info

On the machine there is no xfs drive mounted

[root@kaelk-t495-vm-k8s-worker-1 ~]# mount | grep xfs
[root@kaelk-t495-vm-k8s-worker-1 ~]#
@kael-k
Copy link
Author

kael-k commented May 20, 2022

I also tried:

  • reinstall directpv: same problem
  • manually mounting /dev/vda on kaelk-t495-vm-k8s-worker-1 (to check if the mount process gave me some error), but it mounted without any error

@Praveenrajmani
Copy link
Collaborator

Sometimes in the errors in the event apperaed 'driver name direct-csi-min-io not found in the list of registered CSI drivers'

these errors are expected when there are node/kubelet restarts

I0520 06:28:33.588767 8514 stage_unstage.go:39] "NodeStageVolumeRequest" volumeID="pvc-8d67ba5d-9943-4f5a-b6e8-5515c10928a3" StagingTargetPath="/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8d67ba5d-9943-4f5a-b6e8-5515c10928a3/globalmount"
I0520 06:28:33.625306 8514 utils.go:96] [vda] vendor mismatch: 0x1af4 ->
E0520 06:28:33.625635 8514 server.go:92] GRPC error: rpc error: code = Internal desc = drive value mismatch
I0520 06:28:35.102031 8514 stage_unstage.go:39] "NodeStageVolumeRequest" volumeID="pvc-5616ffd8-6465-4120-9f79-98e2cd847712" StagingTargetPath="/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-5616ffd8-6465-4120-9f79-98e2cd847712/globalmount"
I0520 06:28:35.140995 8514 utils.go:96] [vda] vendor mismatch: 0x1af4 ->

looks like these are virtual disks. DirectPV recognizes the disks by its hardware properties like wwid, serialnumber, mode, vendor etc. So, generally, it is not recommended for virtual disks where these persistent properties are not reliable.

While staging a volume, directpv checks if the drive on which the volume is scheduled is the right one by checking its hardware properties and looking for a mismatch. here, in this case, looks like the vendor property isn't matching. So, directpv denies to mount the volume on that drive. Earlier, the vendorID use to be 0x1af4 which is unset now. you need to make sure these hardware properties are set properly and remains unchanged.

some of the details on virtual disks support is explained in #580

@Praveenrajmani
Copy link
Collaborator

Praveenrajmani commented May 23, 2022

You can refer https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/virtualization_administration_guide/sect-managing_guest_virtual_machines_with_virsh-attaching_and_updating_a_device_with_virsh to configure the disks with vendorID. Still, not sure if these properties will be wiped out on restarts.

you can also check the --persistent and --current flags while attaching the device

@Praveenrajmani Praveenrajmani changed the title Directpv mount stopped working on a single node after update to 3.0.1 NodeStageVolume fails due to value mismatch in v3.0.1 May 23, 2022
@kael-k
Copy link
Author

kael-k commented May 24, 2022

Hi @Praveenrajmani, in these days I am unable check if your suggestion to fix the bug (because meanwhile I manually migrated the PVC on the other nodes), however it's most probably what you said, the other nodes are bare metal with a physical partition, the only one which has a problem is the only one in a VM and with a virtual disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants