Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sandboxed containers enhancement proposal #677

Merged
merged 1 commit into from
Mar 24, 2021

Conversation

fidencio
Copy link

@fidencio fidencio commented Mar 4, 2021

This supersedes #366, as it provides fixes for the lint.

The reason a new PR has been opened is because Jens Freimman, who's the person dealing with #366 recently became a father and is out for his paternity leave.

@fidencio
Copy link
Author

fidencio commented Mar 4, 2021

/test markdownlint

@fidencio
Copy link
Author

fidencio commented Mar 4, 2021

Closing this one as we were able to unblock #366.

@fidencio
Copy link
Author

fidencio commented Mar 4, 2021

/test markdownlint

1 similar comment
@fidencio
Copy link
Author

fidencio commented Mar 4, 2021

/test markdownlint

@fidencio
Copy link
Author

fidencio commented Mar 5, 2021

/test markdownlint

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems sane to me!

### KataContainers Operator development
The goal is to develop a kata operator that can be used to manage the entire lifecycle of kata components in an OpenShift cluster. We will have a controller that watches for a kata **custom resource (CR)** and a daemon-set that acts upon changes to the (CR) and do the required work on the worker nodes (install, uninstall update,...).

Via the CR it will be possible to select a subset of worker nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this out of curiosity? Why not e.g. enable the feature on all available nodes but use regular nodeSelector for steering?

Is it because it fully takes over the crio backend and so hence all containers on that node are kata? That seems unfortunate if true.

Copy link
Author

@fidencio fidencio Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cgwalters!

Why is this out of curiosity?

The default behaviour is to deploy on all available worker nodes. However, it's been decided to leave the option for the admin to select the nodes they want, if they don't want all the worker nodes using Kata Containers.

Why not e.g. enable the feature on all available nodes but use regular nodeSelector for steering?

That would simplify our code, indeed. Shall we consider removing the option from the operator?

Is it because it fully takes over the crio backend and so hence all containers on that node are kata? That seems unfortunate if true.

No, that's not the case, at all. A note capable of doing Kata Containers should not, and will not, lose its ability to do "vanilla" containers.

@cgwalters, have I addressed all your concerns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behaviour is to deploy on all available worker nodes.

Sure, seems reasonable. In a compact cluster the control plane are workers too so this will apply there.

That would simplify our code, indeed. Shall we consider removing the option from the operator?

I don't have a really strong opinion on this but if having admins steer the pods with regular selectors works then why not use that? Less API surface for you to maintain, and feels more native to admins, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is scheduling support for runtime classes that we can make use of. The enhancement should mention that.
https://v1-16.docs.kubernetes.io/docs/concepts/containers/runtime-class/#

It should be noted that Kata Containers differ from Kubevirt:

- Kubevirt aims to run **VM images** on OpenShift providing the VM with the look and feel of a virtual machine running as a legacy VM (CRD defining a Virtual Machine in Kubernetes)
- Kata Containers aim to run **container images** on OpenShift in an isolated manner using virtualization tools (uses Runtime Class resource to choose Kata only for specific workloads which require kernel level isolation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi random but there's actually a 3rd class of thing in here, which is "containers that use virt" - what we do with https://github.com/coreos/coreos-assembler/ overlaps a lot with kata; we want to control qemu, not be inside qemu.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, and I wonder whether the minimalist QEMU we plan to use for the extensions could also benefit CoreOS Assembler.

Do you mind if we switch this conversation to the Slack, at some point soon, to try to sync on the usages and then, maybe, come up with a full document on how we plan to use QEMU as part of RHCOS, and hopefully agree in the minimal QEMU that could attend yours (CoreOS Assembler) and ours (Kata Containers) needs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the qemu we use to build (and test) RHCOS (and FCOS) is included in the coreos-assembler container; it has nothing to do with the host, which is the opposite of kata. It's closer to KubeVirt in a way except this usage is special in that we're not trying to run production VMs that e.g. expose networking. We don't require any privileges, just KVM available.

I have a blurb on this here https://hackmd.io/8YCMjP_JTzu-vp7PrOjT0g that I need to stick somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a minimal qemu really - our container image is already enormous anyways (2.5G+).

What may be good to share though is e.g. Go bindings for interacting with qemu - I touched on that in coreos/coreos-assembler#1336 (comment)


The Operator will:
- Create a crio drop-in config file via machine config objects;
- Automatically select the payload image that contains correct version of the kata binaries for the given version of the OpenShift;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some private discussion it turns out there's a lot more detail behind this bit that we'll want to elaborate on. From what I can see I lean towards having the MCO install this extra content as an extension (RPM) too.

This also relates to coreos/fedora-coreos-tracker#354 etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgwalters, I've rewritten this PR, almost entirely, to cover what we current do, what we could do, etc.

Please, whenever you have some cycles, take a look at the changes.
Also, please, help me to loop in all the needed parts that can help to decide whether we could go for having the Kata Containers RPMs also as part of the RHCOS extensions (even with those being ~100MB, installed size).

@fidencio
Copy link
Author

fidencio commented Mar 5, 2021

In what started as a non related conversation with @cgwalters (https://coreos.slack.com/archives/CK1AE4ZCK/p1614974860056400), we agreed that more discussions have to happen, mainly with regards how Kata Containers will be delivered to the OpenShift customers (as in, whether it'll be part of the Extensions, whether it'll be binaries being copied by the Kata Operator, etc).

Based on that, this PR will be updated accordingly, probably by the beginning of the next week.

@fidencio
Copy link
Author

fidencio commented Mar 5, 2021

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2021
- Create a crio drop-in config file via machine config objects;
- Automatically select the payload image that contains correct version of the kata binaries for the given version of the OpenShift;
- Configure a `RuntimeClass` called `kata` which can be used to deploy workload that uses kata runtime;
- Create a machine-config object to enable the `qemu-kiwi` RHCOS extension.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say instead have single MC object 50-kata-containers-worker or so that does both this and the crio config to make it more atomic. Otherwise there's a race where the MCO might try to roll out just the crio change but not the extension, etc.


#### KataContainers Operator Goals
- Create an API which supports installation of Kata Runtime on all or selected worker nodes
- Configure CRI-O to use Kata Runtime on those worker nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more to this than just the MC?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgwalters, I don't fully understand the question.
"Configure CRI-O to use Kata Runtime on those worker nodes" is just a drop-in file, performed by: https://github.com/openshift/sandboxed-containers-operator/blob/fbbc74e21e2fc4f3d0874dd201513a801a1a333c/controllers/openshift_controller.go#L281-L362

As I'm not sure whether this is what you asked, mind to rephrase the question, with more details is possible, so I can try to answer?

(by the way, sorry if I end up missing some details, the person who majoritarily worked on the Operator side is on paternity leave and I'm taking this over from him).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait, I'm wrong, I'm thinking only about the RPM sides.
"Configure CRI-O to use Kata Runtime" also contains:

  • applying RBAC policies;
  • defining the CRDs;
  • creating the "kata" runtime class;

- Configure CRI-O to use Kata Runtime on those worker nodes
- Installation of the runtimeClass on the cluster, as well as of the required components for the runtime to be controlled by the orchestration layer.
- Updates the Kata runtime
- Uninstall Kata Runtime and reconfigure CRI-O to not use it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If done via extensions, this would all be handled by MCO reconciliation. See also https://blog.verbum.org/2020/08/22/immutable-%E2%86%92-reprovisionable-anti-hysteresis/

I think more broadly, if everything goes via MCO functionality then e.g. node changes will be fully coordinated. Having more host daemonsets etc. can create synchronization issues.

Copy link
Author

@fidencio fidencio Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "everything goes via MCO functionality" you mean ... just forget the operator and teach the MCO to deal with Kata Containers' needs, entirely?

While I think that's a really good idea, I don't think it's do-able for 4.8 timeframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, I just mean that the operator installs machineconfigs, rather than reaching out to each node individually. This is a pattern followed by for example the cluster-node-tuning-operator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that we do. :-)

TotalNodesCount int `json:"totalNodesCount"`

// CompletedNodesCounts is the number of nodes that have successfully completed the given operation
CompletedNodesCount int `json:"CompletedNodesCount"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we fully use the MCO, then all this stuff is covered by the MCO objects, specifically machineconfigpool.

@fidencio
Copy link
Author

fidencio commented Mar 5, 2021

/cc @zanetworker, @bpradipt, @jensfr for visibility

@fidencio
Copy link
Author

fidencio commented Mar 6, 2021

@bpradipt, would you mind to take a careful look on the bits describing the operator? That's one part of the project I barely touched and I don't confident to ensure the text matches reality.

@fidencio
Copy link
Author

fidencio commented Mar 6, 2021

@zanetworker, I will need your help to ensure this aligns with your vision.

@fidencio
Copy link
Author

fidencio commented Mar 6, 2021

@mrunalp, I really would love your input, mainly as things changed a lot from the time when the original PR was created, till what we currently have nowadays.

@jackevans43
Copy link

@fidencio I'd love to use this feature to isolate less trustworthy / higher risk (eg. Internet facing) workloads - do you think this is worth adding? For these workloads, Firecracker would be a great, could this be added in the future?

I presume NetworkPolicy be applied on the host, outside of the VM (ie. provides a higher level of assurance)?

@fidencio
Copy link
Author

fidencio commented Mar 6, 2021

@fidencio I'd love to use this feature to isolate less trustworthy / higher risk (eg. Internet facing) workloads - do you think this is worth adding?

I think this is covered by:

- any other administrative privileges above and beyond what is secure in a
  shared kernel environment (regular runc).

Or would you prefer have it explicitly set there?

For these workloads, Firecracker would be a great, could this be added in the future?

Right now we prefer to focus on the VMM we have a huge team working on, and maintaining it, which is QEMU.
I'd be interested, really interested, on learning about possible use cases you see Firecracker being used that QEMU couldn't. This would help the QEMU community to improve, we (as Sandboxed Containers) to see whether we can expand what we cover, and the end users to take a decision.

I don't think this proposal is the right place to discuss this, but maybe https://github.com/openshift/sandboxed-containers-operator/ could be.

I'd suggest opening an issue there, then we can start the discussion and move to the another (more appropriated?) place if needed.

Does this sound reasonable?

I presume NetworkPolicy be applied on the host, outside of the VM (ie. provides a higher level of assurance)?

Please, open us an issue and we can discuss it there. :-)
By the way, I hope I don't come across as "rude", I just want to be sure that such important conversation gets properly tracked, and properly replied (and here it may be confusing for all the parts involved).

@fidencio
Copy link
Author

@mrunalp, the last requests have been addressed. Thanks for the review.

Signed-off-by: Ariel Adam <aadam@redhat.com>
Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
@fidencio fidencio changed the title kata containers enhancement proposal sandboxed containers enhancement proposal Mar 19, 2021
@mrunalp
Copy link
Member

mrunalp commented Mar 22, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 22, 2021
@mrunalp mrunalp added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 4beba14 into openshift:master Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants