Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workload unbind VF driver #443

Open
zeeke opened this issue May 22, 2023 · 0 comments
Open

Workload unbind VF driver #443

zeeke opened this issue May 22, 2023 · 0 comments

Comments

@zeeke
Copy link
Member

zeeke commented May 22, 2023

This issue is about discussing the scenario where the user workload (or any other actor different than the sriov-config-daemon) unbinds the Virtual Function driver while a VF is assigned to a Pod. If it happens, the VF remains in a unusable state and subsequent pod using that device raises errors like:

Warning  FailedCreatePodSandBox  148m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = 
    failed to create pod network sandbox k8s_test-deployment-66b745fc5c-c64r8_ocpubgs-13574_833b6235-4d08-436f-8bb7-3f20a141748a_0(a3f026aa4229536eb5ebadd839500e44945407fd588f6e0ab202c554c2bd3088): 
    error adding pod ocpubgs-13574_test-deployment-66b745fc5c-c64r8 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" 
    failed (add): [ocpubgs-13574/test-deployment-66b745fc5c-c64r8/833b6235-4d08-436f-8bb7-3f20a141748a:network-ocpubgs-13574]: 
    error adding container to network "network-ocpubgs-13574": 
    SRIOV-CNI failed to load netconf: 
    LoadConf(): failed to detect if VF 0000:19:02.2 has dpdk driver "lstat /sys/devices/pci0000:17/0000:17:02.0/0000:19:02.2/driver: no such file or directory"

Unbinding the driver is not how the operator should be used, so this problem can be addressed by adding this topic to the user documentation.
BTW, it is also tricky to detect, as the pod that raises the error is innocent (well configured) and tracing the culprit can be hard.

a. Does it make any sense to increase the sriov-config-daemon resilience and to rebind the VF driver when it diverges from the expected one?
b. If yes, can it be simpler to implement this behavior in the sriov-cni. Maybe adding a driver check before running a Pod?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant