Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Smart NIC OVN offload enhancement #732

Merged

Conversation

zshi-redhat
Copy link
Contributor

No description provided.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign dcbw after the PR has been reviewed.
You can assign the PR to them by writing /assign @dcbw in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zshi-redhat
Copy link
Contributor Author

enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
@zshi-redhat zshi-redhat force-pushed the smart-nic-ovn-offload branch 2 times, most recently from 5623627 to fbc7e0a Compare April 27, 2021 08:42
@zshi-redhat
Copy link
Contributor Author

Added a paragraph in #### SR-IOV Network Operator section to explain why VF 0 is not advertised to kubernetes node status as extended resource.

enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
### Non-Goals

- Provisioning and managing the OVN-Kubernetes components on SmartNIC ARM cores
- Support multiple SmartNICs on one OpenShift baremetal worker node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is SmartNIC cards, correct? Do we need to distinguish Cards vs the 2 high speed interfaces on each card? Are we going to support both interfaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the SmartNIC cards.

Good point, we only used one port in PoC, I think it's possible to use bonded interfaces which we didn't try yet.
From yesterday's discussion, if we go with single cluster mode (deploy rhcos and run stand-alone kubelet on smartnic), we may or may not use the second interface on BF depending on whether ovn-k8s managed ovs bridge can be updated to arbitrate traffic from the two networks (x86 host cluster network, bluefield2 host-network). If the answer to ovs change is yes, then we don't have to use the second interface unless bond is required, if the answer is no, then we have to use the second interface (for isolate traffic between x86 host cluster network and bluefield2 host-network).

Given above, I think we may not want to exclude the use of second bluefield interface at this point, wdyt?

enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
SmartNIC OVN offload will be enabled with OVN-Kubernetes CNI only.
It is expected that the following changes will be made:

#### OVN-Kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anywhere where we mention that OvS is running on the ARM as a system service and OVN running as a container. Should we describe that here or below in Design Details? Also add an open question about running Kubelet in stand-alone to manage the OVN Container or is that more in the other enhancement (#738)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added three bullets in #### OVN-Kubernetes to describe how ovn components and ovs are running on SmartNIC, SmartNIC node, non-SmartNIC node.
Added a new section #### Container lifecycle management on SmartNIC in ### Open Questions about RHCOS provision and Kubelet in stand-alone mode on SmartNIC.

@zshi-redhat zshi-redhat force-pushed the smart-nic-ovn-offload branch 2 times, most recently from 265988b to 5643033 Compare April 28, 2021 02:19
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
isolation of PCIe resources
- SR-IOV Pods: SR-IOV Pods are pods bound to a VF which is used to send and
receive packets on.
- PF: Physical Function, the physical ethernet controller that supports SR-IOV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For purposes of this enhancement, I think you could remove all the references to PFs and only talk about VFs instead. (ie, everything you might say about PFs here is already covered by the generic SR-IOV work, and talking about PFs doesn't add anything useful in the context of OVN offload)

(Having said that, now I'm realizing that you talk about "VF 0" a bunch below, which I assume means "VF 0 on PF 0", so maybe you do need to talk about PFs?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to explain where the VFs come from (VFs are created from the PF).
Since we only use one PF (one uplink on SmartNIC node) in PoC, VF 0 could be from either PF 0 (uplink 0) or PF 1 (uplink 1). VF 0 here is a restriction imposed by us , not the hardware, in theory any VF can be used as ovn-k8s management port. Having VF 0 as the ovn-k8s management port is easy to orchestrate and be made persistent during node reboot or kernel/system upgrade.
I didn't mention bond (which would require two PFs) since it's not in this initial delivery.

- SmartNIC: A combined Network Interface Card (NIC) and CPU hardware which
offloads processing tasks that the system CPU would normally handle
- SmartNIC node(s): Baremetal servers where SmartNICs are plugged to
- SmartNIC ARM cores: The embedded aarch64 system running on SmartNIC CPU(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a meta-enhancement laying out the common ground between this one and #738... (In particular I'm thinking that now because I mention in #738 that the term "Smart NIC" is used ambiguously by vendors, and not everything that is being called a "Smart NIC" is something we could actually support for offload...)

(I think I'm going to work on this...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this meta-enhancement be a separate one?


- SmartNIC: A combined Network Interface Card (NIC) and CPU hardware which
offloads processing tasks that the system CPU would normally handle
- SmartNIC node(s): Baremetal servers where SmartNICs are plugged to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been calling that the "host" (or "Smart NIC host"). Especially because, once we're running OCP on the NICs, then "Smart NIC node" would sound like it means "an OCP node corresponding to a Smart NIC ARM core".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added SmartNIC host(s) which refers to the x86 server where SmartNIC is plugged to
Updated SmartNIC node(s) to refer to the OpenShift node provisioned on SmartNIC ARM core.

enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
There are several possible ways to avoid the failure:

- Modify cluster pod manifests to run with `hostNetwork: true` which doesn't
require OVN-Kubernetes CNI to create pod network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in general not a good solution because being hostNetwork gives pods a lot of additional privilege...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly mentioned the drawback of using hostNetwork:true

enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
in cluster pod manifests, since the cluster pods are usually daemonsets or
deployments, adding SR-IOV resource request in daemonsets or deployments will
result in their pods be scheduled only on nodes that advertising the SR-IOV
resources, this doesn't work for non-SmartNIC node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably explicitly talk about the idea of using a mutating admission webhook.

The big problem with this was that the webhook runs before the pod is scheduled, so it can't know whether the pod is going to end up on a "smart" or "dumb" node.

But I realized that for the case of pods created by a DaemonSet, the DaemonSet controller allocates the pods to nodes itself and creates them with nodeName already set. So maybe we could make this work automatically for DaemonSet-created pods (eg, dns, network-check-target), and just ensure that Deployment-created pods don't have the right tolerations/labelSelectors to end up on the smart nodes? (Implying that at least some dumb workers are needed, for stuff like the router.) (Although also, really, customers are probably eventually going to want to have the router on a smart node.)

(Long term, I still think the fix is that we shouldn't be using pod resources here; the CNI plugin should just allocate the VF itself automatically, and if it runs out of VFs, it should be able to indicate that to kubelet. That's really no different from when the CNI plugin tries to create a pod but discovers that it has used up all the pod IPs in the node's subnet [which the CNI plugin doesn't currently have a good way to report back to kubelet, but it should].)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably explicitly talk about the idea of using a mutating admission webhook.

The big problem with this was that the webhook runs before the pod is scheduled, so it can't know whether the pod is going to end up on a "smart" or "dumb" node.

Added new bullet of using admission controller.

But I realized that for the case of pods created by a DaemonSet, the DaemonSet controller allocates the pods to nodes itself and creates them with nodeName already set. So maybe we could make this work automatically for DaemonSet-created pods (eg, dns, network-check-target), and just ensure that Deployment-created pods don't have the right tolerations/labelSelectors to end up on the smart nodes? (Implying that at least some dumb workers are needed, for stuff like the router.) (Although also, really, customers are probably eventually going to want to have the router on a smart node.)

This sounds a good approach for daemonset if admission controller can get the nodeName from pod spec before the pod is actually scheduled to node kubelet. It also enables the cluster pod to use SR-IOV devices, but I guess this doesn't matter much as we have 128x2 VFs from one BF2 card.

For deployment, maybe we can reuse the labels introduced in CNO (network.operator.openshift.io/ovn-offload-host) and make other cluster pods be aware of this well-known label in their manifests. The label means ovn-k8s is running in a special mode and I feel it has reasons for cluster pod to accommodate or recognize the label for network mode change.
The label would also be useful in the future if there is a need to run cluster pods on/off the smartNIC host.
Further to the above label, we may want another label for pods that need to run on SmartNIC node (a label that distinguish from full mode host and SmartNIC host).

(Long term, I still think the fix is that we shouldn't be using pod resources here; the CNI plugin should just allocate the VF itself automatically, and if it runs out of VFs, it should be able to indicate that to kubelet. That's really no different from when the CNI plugin tries to create a pod but discovers that it has used up all the pod IPs in the node's subnet [which the CNI plugin doesn't currently have a good way to report back to kubelet, but it should].)

Several considerations (not fully thinking through this yet):

  1. SR-IOV may not be the only driver feature that allows use of smartnic offload, there is another feature (e.g. the sub-function device), which would unlock the SR-IOV device number limitation (128 per port), how would ovn-k8s distinguish if user wants to have a SR-IOV pod or a Pod with sub-function device? It may require a new config in ovn-k8s CNI to select the device type (SR-IOV or sub-function) and be made configurable in net-attach-def. This however will have the same issue eventually that cluster pods need to tell or express which CNI config it wants to use.
    Also with multiple SmartNICs installed on one x86 node (if it will be a use case), ovn-k8s cannot tell from which SmartNIC the SR-IOV VF should be used to create pod.
  2. With kubelet (device plugin framework) allocating VF device, it allows passing special device spec and mount when creating containers, this cannot simply be done with CNI interface since it's too late for CNI to do mount during pod creation.
  3. Device plugin framework can prevent a pod be scheduled on the node that doesn't have sufficient resources, this doesn't require kubelet to handle resource shortage. It would also differs on how user would see the failure. For device plugin, the pod is pending for available resource; for CNI, it would likely be a failure (then pod may be re-scheduled to another node if kubelet handles the resource shortage nicely).

In short, we would lose (need to rebuild in ovn-k8s) the capabilities provided in k8s device-plugin framework or any features built on top of it and there are new challenges in managing limited devices in ovn-k8s (which may be out of CNI scope).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a good approach for daemonset if admission controller can get the nodeName from pod spec before the pod is actually scheduled to node kubelet.

I did an experiment with sriov webhook: for daemonset pod, the pod.Spec.NodeName received in webhook is empty.

resources, this doesn't work for non-SmartNIC node
- Find a way to enable Open vSwitch for non-SR-IOV pods on SmartNIC node
- Taint the SmartNIC nodes so that pods without corresponding toleration won't
be scheduled to SmartNIC nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we definitely need to taint the nodes; we need to prevent customer workload pods from being scheduled there unintentionally as well

the same or a second OpenShift Cluster (vs SmartNIC node cluster). Initial
agreement from the virtual team working on the BlueField PoC is that we will
pursue the following approach for a short-term solution on SmartNIC
provisioning and deployment:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not sure this is really needed as part of this enhancement... Especially since it's still in flux.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed #### Container lifecycle management on SmartNIC.


There are three modes introduced in OVN-Kubernetes for running ovnkube-node:
- **full**: No change to current ovnkube-node logic, runs on Regular OpenShift nodes
- **smart-nic**: ovnkube-node on SmartNIC ARM cores, plumbs VF representor device

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it also initializes configures and manages ovn-kubernetes gateway and management port functionality related to OVS running on SmartNIC ARM cores

e.g configures VF rep port for management port, manages openflow rules on br-ex.

There are three modes introduced in OVN-Kubernetes for running ovnkube-node:
- **full**: No change to current ovnkube-node logic, runs on Regular OpenShift nodes
- **smart-nic**: ovnkube-node on SmartNIC ARM cores, plumbs VF representor device
- **smart-nic-host**: ovnkube-node on SmartNIC host, runs cniserver and plumbs VF device

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it also configures host side management port, sets up appropriate routes for management port and manages ovn-kubernetes gateway host side functionality e.g adds routes, iptable rules for node port, healthcheck for external load balancer.

purposes, it is possible to restrict the host from performing operations that
can compromise the DPU.

BlueField2 needs to be configured to **Embedded CPU** or **Restricted** mode

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this operation may require powercycle and not reboot as ARM OS needs to reboot as well.
need to test it out on a system to validate.

in your lab, when moving from separated host to embedded cpu was reboot sufficient ? if so then disregard my comment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also require a recent enough firmware. I'm missing management of that aspect here. Note that after updating firmware, both the host and the Bluefield 2 need to be rebooted (similarly to the remark from Adrian, a power cycle may be required).

This is currently done through ssh or console from SmartNIC host using rshim
[userspace driver](https://bugzilla.redhat.com/show_bug.cgi?id=1744737).
Rshim userspace driver needs to be made availabe in RHCOS image or downloadable
in Universal Base Image (UBI).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: mlxbf_gige driver has been accepted upstream

torvalds/linux@f92e186

- Find a way to enable Open vSwitch for non-SR-IOV pods on SmartNIC host
- Taint the SmartNIC hosts so that pods without corresponding toleration won't
be scheduled to SmartNIC hosts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not a stellar idea, but if we had a way for kubelet to support "default resources" for (non host network) pods it would solve the issue.

if the cluster is homogenous (all k8s workers with Smart-NICs) then the resource can be injected by default.
Subfunctions will allow better scale (currently capped at 500)

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 30d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2021
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2021
@zshi-redhat
Copy link
Contributor Author

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 13, 2021
@zshi-redhat
Copy link
Contributor Author

/reopen

@openshift-ci openshift-ci bot reopened this Oct 15, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 15, 2021

@zshi-redhat: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
enhancements/network/smart-nic-ovn-offload.md Outdated Show resolved Hide resolved
@danwinship
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2021
@danwinship
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 8, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 8, 2021
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

Signed-off-by: Zenghui Shi <zshi@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 9, 2021
@zshi-redhat
Copy link
Contributor Author

Fixed the markdownlint check.

@danwinship
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 9, 2021
@openshift-merge-robot openshift-merge-robot merged commit abd6f7e into openshift:master Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.