sandboxed containers enhancement proposal #677

fidencio · 2021-03-04T17:32:05Z

This supersedes #366, as it provides fixes for the lint.

The reason a new PR has been opened is because Jens Freimman, who's the person dealing with #366 recently became a father and is out for his paternity leave.

fidencio · 2021-03-04T17:34:42Z

/test markdownlint

fidencio · 2021-03-04T18:24:43Z

Closing this one as we were able to unblock #366.

fidencio · 2021-03-04T18:47:57Z

/test markdownlint

fidencio · 2021-03-04T18:50:36Z

/test markdownlint

fidencio · 2021-03-05T19:35:18Z

/test markdownlint

cgwalters

Overall seems sane to me!

cgwalters · 2021-03-05T19:50:21Z

enhancements/kata-containers/kata-containers-dev-preview.md

+### KataContainers Operator development
+The goal is to develop a kata operator that can be used to manage the entire lifecycle of kata components in an OpenShift cluster. We will have a controller that watches for a kata **custom resource (CR)** and a daemon-set that acts upon changes to the (CR) and do the required work on the worker nodes (install, uninstall update,...).
+
+Via the CR it will be possible to select a subset of worker nodes.


Why is this out of curiosity? Why not e.g. enable the feature on all available nodes but use regular nodeSelector for steering?

Is it because it fully takes over the crio backend and so hence all containers on that node are kata? That seems unfortunate if true.

Hey @cgwalters!

Why is this out of curiosity?

The default behaviour is to deploy on all available worker nodes. However, it's been decided to leave the option for the admin to select the nodes they want, if they don't want all the worker nodes using Kata Containers.

Why not e.g. enable the feature on all available nodes but use regular nodeSelector for steering?

That would simplify our code, indeed. Shall we consider removing the option from the operator?

Is it because it fully takes over the crio backend and so hence all containers on that node are kata? That seems unfortunate if true.

No, that's not the case, at all. A note capable of doing Kata Containers should not, and will not, lose its ability to do "vanilla" containers.

@cgwalters, have I addressed all your concerns?

The default behaviour is to deploy on all available worker nodes.

Sure, seems reasonable. In a compact cluster the control plane are workers too so this will apply there.

That would simplify our code, indeed. Shall we consider removing the option from the operator?

I don't have a really strong opinion on this but if having admins steer the pods with regular selectors works then why not use that? Less API surface for you to maintain, and feels more native to admins, etc.

There is scheduling support for runtime classes that we can make use of. The enhancement should mention that.
https://v1-16.docs.kubernetes.io/docs/concepts/containers/runtime-class/#

cgwalters · 2021-03-05T19:52:07Z

enhancements/kata-containers/kata-containers-dev-preview.md

+It should be noted that Kata Containers differ from Kubevirt:
+
+- Kubevirt aims to run **VM images** on OpenShift providing the VM with the look and feel of a virtual machine running as a legacy VM (CRD defining a Virtual Machine in Kubernetes)
+- Kata Containers aim to run **container images** on OpenShift in an isolated manner using virtualization tools (uses Runtime Class resource to choose Kata only for specific workloads which require kernel level isolation)


Semi random but there's actually a 3rd class of thing in here, which is "containers that use virt" - what we do with https://github.com/coreos/coreos-assembler/ overlaps a lot with kata; we want to control qemu, not be inside qemu.

That's interesting, and I wonder whether the minimalist QEMU we plan to use for the extensions could also benefit CoreOS Assembler.

Do you mind if we switch this conversation to the Slack, at some point soon, to try to sync on the usages and then, maybe, come up with a full document on how we plan to use QEMU as part of RHCOS, and hopefully agree in the minimal QEMU that could attend yours (CoreOS Assembler) and ours (Kata Containers) needs?

Note that the qemu we use to build (and test) RHCOS (and FCOS) is included in the coreos-assembler container; it has nothing to do with the host, which is the opposite of kata. It's closer to KubeVirt in a way except this usage is special in that we're not trying to run production VMs that e.g. expose networking. We don't require any privileges, just KVM available.

I have a blurb on this here https://hackmd.io/8YCMjP_JTzu-vp7PrOjT0g that I need to stick somewhere.

We don't need a minimal qemu really - our container image is already enormous anyways (2.5G+).

What may be good to share though is e.g. Go bindings for interacting with qemu - I touched on that in coreos/coreos-assembler#1336 (comment)

cgwalters · 2021-03-05T21:48:44Z

enhancements/kata-containers/kata-containers-dev-preview.md

+
+The Operator will:
+- Create a crio drop-in config file via machine config objects;
+- Automatically select the payload image that contains correct version of the kata binaries for the given version of the OpenShift;


In some private discussion it turns out there's a lot more detail behind this bit that we'll want to elaborate on. From what I can see I lean towards having the MCO install this extra content as an extension (RPM) too.

This also relates to coreos/fedora-coreos-tracker#354 etc.

@cgwalters, I've rewritten this PR, almost entirely, to cover what we current do, what we could do, etc.

Please, whenever you have some cycles, take a look at the changes.
Also, please, help me to loop in all the needed parts that can help to decide whether we could go for having the Kata Containers RPMs also as part of the RHCOS extensions (even with those being ~100MB, installed size).

fidencio · 2021-03-05T21:50:22Z

In what started as a non related conversation with @cgwalters (https://coreos.slack.com/archives/CK1AE4ZCK/p1614974860056400), we agreed that more discussions have to happen, mainly with regards how Kata Containers will be delivered to the OpenShift customers (as in, whether it'll be part of the Extensions, whether it'll be binaries being copied by the Kata Operator, etc).

Based on that, this PR will be updated accordingly, probably by the beginning of the next week.

fidencio · 2021-03-05T21:50:39Z

/hold

cgwalters · 2021-03-05T21:49:22Z

enhancements/kata-containers/kata-containers-dev-preview.md

+- Create a crio drop-in config file via machine config objects;
+- Automatically select the payload image that contains correct version of the kata binaries for the given version of the OpenShift;
+- Configure a `RuntimeClass` called `kata` which can be used to deploy workload that uses kata runtime;
+- Create a machine-config object to enable the `qemu-kiwi` RHCOS extension.


I'd say instead have single MC object 50-kata-containers-worker or so that does both this and the crio config to make it more atomic. Otherwise there's a race where the MCO might try to roll out just the crio change but not the extension, etc.

cgwalters · 2021-03-05T21:50:48Z

enhancements/kata-containers/kata-containers-dev-preview.md

+
+#### KataContainers Operator Goals
+- Create an API which supports installation of Kata Runtime on all or selected worker nodes
+- Configure CRI-O to use Kata Runtime on those worker nodes


Is there more to this than just the MC?

@cgwalters, I don't fully understand the question.
"Configure CRI-O to use Kata Runtime on those worker nodes" is just a drop-in file, performed by: https://github.com/openshift/sandboxed-containers-operator/blob/fbbc74e21e2fc4f3d0874dd201513a801a1a333c/controllers/openshift_controller.go#L281-L362

As I'm not sure whether this is what you asked, mind to rephrase the question, with more details is possible, so I can try to answer?

(by the way, sorry if I end up missing some details, the person who majoritarily worked on the Operator side is on paternity leave and I'm taking this over from him).

Oh, wait, I'm wrong, I'm thinking only about the RPM sides.
"Configure CRI-O to use Kata Runtime" also contains:

applying RBAC policies;

defining the CRDs;

creating the "kata" runtime class;

cgwalters · 2021-03-05T21:51:27Z

enhancements/kata-containers/kata-containers-dev-preview.md

+- Configure CRI-O to use Kata Runtime on those worker nodes
+- Installation of the runtimeClass on the cluster, as well as of the required components for the runtime to be controlled by the orchestration layer.
+- Updates the Kata runtime
+- Uninstall Kata Runtime and reconfigure CRI-O to not use it.


If done via extensions, this would all be handled by MCO reconciliation. See also https://blog.verbum.org/2020/08/22/immutable-%E2%86%92-reprovisionable-anti-hysteresis/

I think more broadly, if everything goes via MCO functionality then e.g. node changes will be fully coordinated. Having more host daemonsets etc. can create synchronization issues.

When you say "everything goes via MCO functionality" you mean ... just forget the operator and teach the MCO to deal with Kata Containers' needs, entirely?

While I think that's a really good idea, I don't think it's do-able for 4.8 timeframe.

Ah no, I just mean that the operator installs machineconfigs, rather than reaching out to each node individually. This is a pattern followed by for example the cluster-node-tuning-operator.

Oh, that we do. :-)

cgwalters · 2021-03-05T21:53:44Z

enhancements/kata-containers/kata-containers-dev-preview.md

+  TotalNodesCount int `json:"totalNodesCount"`
+
+  // CompletedNodesCounts is the number of nodes that have successfully completed the given operation
+  CompletedNodesCount int `json:"CompletedNodesCount"`


So if we fully use the MCO, then all this stuff is covered by the MCO objects, specifically machineconfigpool.

fidencio · 2021-03-05T22:35:37Z

/cc @zanetworker, @bpradipt, @jensfr for visibility

fidencio · 2021-03-06T12:09:32Z

@bpradipt, would you mind to take a careful look on the bits describing the operator? That's one part of the project I barely touched and I don't confident to ensure the text matches reality.

fidencio · 2021-03-06T12:10:33Z

@zanetworker, I will need your help to ensure this aligns with your vision.

fidencio · 2021-03-06T12:11:46Z

@mrunalp, I really would love your input, mainly as things changed a lot from the time when the original PR was created, till what we currently have nowadays.

jackevans43 · 2021-03-06T13:02:37Z

@fidencio I'd love to use this feature to isolate less trustworthy / higher risk (eg. Internet facing) workloads - do you think this is worth adding? For these workloads, Firecracker would be a great, could this be added in the future?

I presume NetworkPolicy be applied on the host, outside of the VM (ie. provides a higher level of assurance)?

fidencio · 2021-03-06T13:19:32Z

@fidencio I'd love to use this feature to isolate less trustworthy / higher risk (eg. Internet facing) workloads - do you think this is worth adding?

I think this is covered by:

- any other administrative privileges above and beyond what is secure in a
  shared kernel environment (regular runc).

Or would you prefer have it explicitly set there?

For these workloads, Firecracker would be a great, could this be added in the future?

Right now we prefer to focus on the VMM we have a huge team working on, and maintaining it, which is QEMU.
I'd be interested, really interested, on learning about possible use cases you see Firecracker being used that QEMU couldn't. This would help the QEMU community to improve, we (as Sandboxed Containers) to see whether we can expand what we cover, and the end users to take a decision.

I don't think this proposal is the right place to discuss this, but maybe https://github.com/openshift/sandboxed-containers-operator/ could be.

I'd suggest opening an issue there, then we can start the discussion and move to the another (more appropriated?) place if needed.

Does this sound reasonable?

I presume NetworkPolicy be applied on the host, outside of the VM (ie. provides a higher level of assurance)?

Please, open us an issue and we can discuss it there. :-)
By the way, I hope I don't come across as "rude", I just want to be sure that such important conversation gets properly tracked, and properly replied (and here it may be confusing for all the parts involved).

enhancements/sandboxed-containers/sandboxed-containers-tech-preview.md

fidencio · 2021-03-16T17:18:31Z

@mrunalp, the last requests have been addressed. Thanks for the review.

Signed-off-by: Ariel Adam <aadam@redhat.com> Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>

mrunalp · 2021-03-22T15:25:48Z

/lgtm

This is part of openshift/enhancements#677 https://bugzilla.redhat.com/show_bug.cgi?id=1941342

openshift-ci-robot · 2021-03-24T15:32:05Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested review from ironcladlou and JoelSpeed March 4, 2021 17:32

fidencio closed this Mar 4, 2021

fidencio reopened this Mar 4, 2021

fidencio force-pushed the kata branch from 7c87800 to 3f66968 Compare March 4, 2021 18:46

fidencio force-pushed the kata branch from 3f66968 to 2ea3f42 Compare March 5, 2021 18:51

cgwalters approved these changes Mar 5, 2021

View reviewed changes

cgwalters reviewed Mar 5, 2021

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2021

cgwalters reviewed Mar 5, 2021

View reviewed changes

This was referenced Mar 5, 2021

Add new extension for qemu-kiwi openshift/machine-config-operator#2376

Closed

kata containers enhancement proposal #366

Closed

fidencio force-pushed the kata branch 3 times, most recently from 77e79b8 to d5f833d Compare March 6, 2021 12:02

fidencio force-pushed the kata branch from d5f833d to b00c7a7 Compare March 6, 2021 12:29

bpradipt reviewed Mar 8, 2021

View reviewed changes