diff --git a/keps/prod-readiness/sig-storage/2857.yaml b/keps/prod-readiness/sig-storage/2857.yaml new file mode 100644 index 000000000000..6858fd9dfd78 --- /dev/null +++ b/keps/prod-readiness/sig-storage/2857.yaml @@ -0,0 +1,3 @@ +kep-number: 2857 +alpha: + approver: "" diff --git a/keps/sig-storage/2857-runtime-assisted-pv-mounts/README.md b/keps/sig-storage/2857-runtime-assisted-pv-mounts/README.md new file mode 100644 index 000000000000..a5138ea4bc11 --- /dev/null +++ b/keps/sig-storage/2857-runtime-assisted-pv-mounts/README.md @@ -0,0 +1,1319 @@ + +# KEP-2857: Runtime Assisted Mounting of Persistent Volumes + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + +Certain container runtimes (e.g. Hypervisor based runtimes like Kata) may prefer +to manage the file system mount process associated with persistent volumes +consumed by containers in a pod. Deferring the file system mount to a container +runtime - in coordination with a CSI plugin - is desirable in scenarios where +strong isolation between pods is critical but using Raw Block mode and changing +existing workloads to perform the filesystem mount is burdensome. This KEP +proposes a set of enhancements to enable coordination around mounting and +management of the file system on persistent volumes between a CSI Plugin and a +container runtime. + +## Motivation + + +This KEP is inspired by (design proposals)[https://github.com/kata-containers/kata-containers/pull/1568/files#diff-84b39825c8b74cc8c274098d9d8274ac82380041b0830e5c2c2ddc41b76f9cd3] in the Kata community to avoid mounting +the file system of a PV in the host while bringing up a pod (in a guest sandbox) +that mounts the PV. Initial prototypes focussed on enabling runtime assisted +mounting within the guest sandbox by modifying CSI plugins (to be enlightened to +skip the filesystem mount at the host) and transferring the mount details to the +Kata runtime through a metadata file in the CSI publish path. However, the +authenticity of the metadata presented by a CSI plugin (could not be established)[https://github.com/kata-containers/kata-containers/pull/1568/files#r649326008] +without further context. This led to the need for a more authoritative source +like Kubelet providing the mount details through a general API path that can be +utilized by all CRI/OCI container runtimes if they wish to control and manage +the file system mount of volumes within the pod sandbox. + +### Goals + + +- Enable a CSI plugin and a container runtime to coordinate publishing of +persistent storage to workloads by deferring the file system mount to the +runtime. +- Assist a CSI plugin and a container runtime with coordination of management +operations that involve the file system (e.g. collection of file system stats +and expanding the file system) +- Allow a CSI plugin to operate in normal mode (without deferring mounting and +management of file system to a container runtime) when publishing volumes to +pods backed by container runtimes that do not support deferred mounts. + + +### Non-Goals + + +- Validate and enforce compatibility between CSI plugin versions and Runtime +versions around support for runtime assisted mounts. +- Enable a CSI node plugin to be launched within a special runtime class +(e.g. Kata). It is expected that a CSI node plugin pods launch using a +“default” runtime (like runc) and is not restricted from performing privileged +operations on the host OS. + + +### Existing Solutions and Gaps + +- A pod using a microvm runtime (like Kata) can mount PVs backed by block +devices as a raw block device to ensure the file system mount is managed within +the pod sandbox environment. However this approach has the following shortcomings: + - Every workload pod that mounts PVs backed by block devices needs to mount + the file system (on the raw block PV). This requires modifications to the + workload container logic. Therefore this is not a general purpose solution. + - File system management operations like online expansion of the file system + and reporting of FS stats cannot be performed at an infrastructural level if + the file system mounts are managed by workload pods. + - Pods that mount PVs backed by a shared FS like NFS or CIFS cannot use raw + block devices. +- A pod using a microvm runtime (like Kata) may use a filesystem (e.g. virtio-fs) +to project the file-system on-disk (and mounted on the host) to the guest +environment (as detailed in the diagram below). While this solves several +use-cases, it is not desired due to the following factors: + - Native file-system features: The workload may wish to use specific features + and system calls supported by an on-disk file system (e.g. ext4/xfs) mounted + on the host. However, plumbing some of these features/system calls/file + controls through the intermediate “projection” file system (e.g. virtio-fs) + and it’s dependent framework (e.g. FUSE) may require extra work and therefore + not supported until implemented and tested. Some examples of this + incompatibility are: (open_by_handle_at)[https://gitlab.com/virtio-fs/qemu/-/issues/10], (xattr)[https://gitlab.com/virtio-fs/qemu/-/issues/15], (F_SETLKW)[https://gitlab.com/virtio-fs/qemu/-/issues/9], fscrypt. + - Safety and isolation: A compromised or malicious workload may try to exploit + vulnerabilities in the file system of persistent volumes. Due to the inherently + complex nature of file system modules running in kernel mode, they present a + larger surface of attack relative to simpler/lower level modules in the + disk/block stack. Therefore, it is safer to mount the file system within an + isolated sandbox environment (like guest VM) rather than in the host + environment as the blast radius of a file system kernel mode panic will be + isolated to the sandbox rather than affecting the entire host. + - Performance: As pointed out (here)https://archive.fosdem.org/2020/schedule/event/vai_virtio_fs/attachments/slides/3666/export/events/attachments/vai_virtio_fs/slides/3666/virtio_fs_A_Shared_File_System_for_Virtual_Machines_FOSDEM.pdf [slide 6], in a microvm environment, a block + interface (such as virtio-blk) provides a faster path to sharing data between a + guest and host relative to a file-system interface. + +## Proposal + + + +At a high level, coordination between a CSI Plugin and a container runtime +around mounting and management of the file system on persistent volumes can be +accomplished through various ways. This section provides an overview of the +primary enhancements detailed in this KEP. Various alternatives are listed further +below in the Alternatives section. Details of the new APIs outlined below are +specified in the Design Details section down below. + +### Support for Runtime Assisted Mounts of File System on Persistent Volumes + +The ability of a CSI plugin to defer execution of the file system mount on persistent +volumes to the container runtime is a basic requirement in the context +of this KEP. Overview of the enhancements necessary to support this are: +- A new field in RuntimeClass to indicate a handler/runtime can support mount +deferral. +- A new field in CSI `NodePublishVolumeRequest` for the Kubelet to indicate +to the CSI plugin that the pod (that a PV is published to) is associated with a +runtime that supports mounting of the filesystem associated with the PV. +- New fields in CSI `NodePublishVolumeResponse` for a CSI plugin to specify file +system mount parameters that the Kubelet should pass to the container runtime. +- Enhancements to the `Mount` message in CRI to allow specification of a source +that maybe a block device or path to a shared file system (with no host path) +along with file system type and options to use for mounting the source in the +runtime's sandbox environment. +- Enhancements to Kubelet to populate the new field in CSI `NodePublishVolumeRequest`, +extract mount details from CSI `NodePublishVolumeResponse` and populate them in +the new fields in CRI `Mount` message. +- Enhancements to CRI runtime (like containerD/CRIO) to populate OCI spec passed +to OCI runtime based on file system mount details in the CRI `Mount` message. +- Enhancements to OCI runtimes to support mounting of the FileSystem with +specified mount options based on the OCI spec (if not present already). +- Enhancements to CSI plugins to inspect the new field in NodePublishVolumeRequest +and defer file system mounts to a container runtime based by populating the new +fields in CSI NodePublishVolumeResponse. If the CSI plugin supports CSI +NodeVolumeStage, it will need to unmount the filesystem if it had mounted it to +a global staging path on the host. + +### Support for Runtime Assisted Management of File System on Persistent Volumes + +The ability of a CSI plugin to defer execution of the file system management +operations on persistent volumes to the container runtime handler is an optional +requirement in the context of this KEP. Based on the current CSI spec, file +system management operations like querying stats and condition of the file system +and expanding the file system (as part of online expansion) are optional +capabilities of the CSI plugin. Similarly, the ability to handle deferred file +system management operations will be an optional capability of a container runtime +handler that supports deffered mounting of file systems on a persistent volume. +If a container runtime handler does not support deffered mounting of file systems +on a persistent volume, it is not expected to support any deferred file system +management operations since the runtime will not own a file system mount to manage. + +A new CSI service: `Runtime` is proposed to coordinate the file system management +operations with a container runtime handler. The `Runtime` APIs are expected to be +invoked by the Kubelet over a unix domain socket surfaced by the container runtime +handler and specified as a new field in `RuntimeClass` (detailed below). In a non +Kubernetes context, it is up to the container orchestrator to determine how to +interact with the container runtime (or container runtime handler). Relevant +parameters for the APIs will be provided to the Kubelet by CSI plugins through +CSI API responses unless the Kubelet is already aware of them through other APIs +like CRI. Detailed specification of the `Runtime` service APIs is provided in +the Design Details section below. Generally, the `Runtime` APIs are expected to be +closely aligned with CSI `Node` APIs. Therefore, the `Runtime` APIs are proposed +as an extension to the CSI spec rather than published and maintained as an +independent API. + +Direct invocation of the `Runtime` service APIs by CSI plugins (without involving +the Kubelet) is not considered as it introduces certain practical limitations: +the domain socket path for invoking file system managmenet operations surfaced +by a container runtime handler will be different across: +[1] container runtime handlers +[2] clusters with the same runtime handler depending on installation and +configuration options supported by the runtime handler. +As a result, bind mounts for the domain socket paths within the CSI node plugin +pod configurations cannot be pre-determined. Therefore, it is preferrable to have +the Kubelet (running on the host), invoke filesystem management operations on a +container runtime handler over a unix domain socket path specified in the +RuntimeClass (as outlined below and described in more details in the Design Details +section). + +Overview of the enhancements necessary to support execution of file system +management operations by a container runtime are: +- New field in `RuntimeClass` to specify a unix domain socket path over which the +Kubelet can invoke FileSystem Management APIs that will be handled by the +container runtime. +- New field in CSI `NodeGetVolumeStatsRequest` and `NodeVolumeExpandRequest` for the +Kubelet to indicate to a CSI plugin that the pod (that a PV is published to) is +associated with a runtime that supports management of the filesystem associated +with the PV. +- New field in CSI `NodeGetVolumeStatsResponse` and `NodeVolumeExpandResponse` for +a CSI plugin to indicate to the Kubelet that it should pass the request to the +container runtime by specifying a `source` field corresponding to the volume +(e.g. host path for a block device or server export path for a shared file system) +- A new CSI service: `Runtime` initially with two RPCs: `RuntimeGetFileSystemStats` +and `RuntimeExpandVolume` (corresponding to CSI `NodeGetVolumeStats` and +`NodeExpandVolume`) +- Enhancements to Kubelet to populate the new field in CSI `NodeGetVolumeStatsRequest` +and `NodeVolumeExpandRequest` (based on `RuntimeClass` field), extract the new +`source` field from `NodeGetVolumeStatsResponse` and `NodeVolumeExpandResponse` and +invoke the corresponding new `Runtime` APIs on the container runtime handler. +- Enhancements to OCI runtimes to support CSI `Runtime` API and surface a unix +domain socket over which the API will be invoked. +- Enhancements to CSI plugins to inspect the new field in CSI +`NodeGetVolumeStatsRequest` and `NodeVolumeExpandRequest` and specify deferral of +stats and online file system expansion by populating the new `source` field +corresponding to the volume in CSI `NodeGetVolumeStatsResponse` and +`NodeVolumeExpandResponse`. The desired size of the file system will also need to +be populated by the CSI plugin in `NodeVolumeExpandResponse` (if it doesn't already +populate it). + +### User Stories (Optional) + + + +#### Story 1 + +A pod specifying a microvm runtime (that is enabled to handle mounting and management +of file system operations) and persistent volumes hosted in a block +device managed by a CSI plugin (that can coordinate file system operations with +the runtime handler) is scheduled on a node. Here are the sequence of steps that +takes place in the context of PVs: +- The node CSI plugin receives a CSI `NodeVolumeStage` call. The CSI plugin will +stage the volume (ensuring it is formatted) and mount the filesystem +(associated with the PV) in the node host OS environment. This stage happens +irrespective of the runtime's capabailities indicated in the subsequent CSI +NodeVolumePublish call. +- The node CSI plugin receives a CSI `NodeVolumePublish` call. In response, the +CSI plugin will unmount the file system (if mounted as part of stage) and pass +the block device path (rather than a file system mount point) +along with file-system type and mount options to the Kubelet +- The kubelet passes the block device, file system type and mount options (from +the CSI plugin) to the microvm runtime through CRI `Mount` and OCI `mount` fields. +- The microvm runtime attaches the block device to the sandbox environment. +- The microvm runtime mounts the filesystem on the block device (using the +specified file system and mount options passed from CSI plugi via kubelet) within +the sandbox environment. +- The microvm runtime makes the filesystem mount point available to a +container at the path specified in the pod spec. + +While the pod runs: +- The node CSI plugin receives a CSI `NodeGetVolumeStats` call. In response, the CSI +plugin passes the block device path on the host (corresponding to the volume) to +the Kubelet. Based on the response, the Kubelet invokes `RuntimeGetFileSystemStats` +on the microvm runtime's unix domain socket for handling CSI Runtime APIs and +processes the response. +- The node CSI plugin receives a CSI `NodeExpandVolume` call (after the disk backing +the volume is expanded). In response, the CSI plugin passes the block device +path on the host (corresponding to the volume) to the Kubelet. Based on the +response, the Kubelet invokes `RuntimeExpandVolume` on the microvm runtime's unix +domain socket for handling CSI Runtime APIs. + +When the pod terminates: +- The runtime dismounts the filesystem on the block device and detaches the +block device from the sandbox environment. +- The node CSI plugin receives CSI `NodeUnpublishVolume` and `NodeUnstageVolume` +and cleanups any state but skips dismounting the file system on the PV. + +#### Story 2 + +A pod specifying a microvm runtime (that is enabled to handle mounting and management +of file system operations) and persistent volumes backed by a shared FS +(e.g. NFS) and managed by a CSI plugin (that can [a] coordinate certain actions +with the microvm runtime and [b] specifies lack of support for Stage CSI calls) +is scheduled on a node. Here are the sequence of steps that takes place in the +context of PVs: +- The node CSI plugin receives a CSI `NodeVolumePublish` call. The CSI plugin skips +staging and mounting the shared FS on the host. In the response to CSI +`NodeVolumePublish`, the CSI plugin passes the server, exported path and mount +options to the Kubelet. +- The Kubelet passes the server, exported path and mount options (from the CSI +plugin) to the microvm runtime through CRI `Mount` and OCI `mount` fields. +- The microvm runtime mounts the remote path within the sandbox environment. +- The microvm runtime makes the mountpoint available to a container at the +location specified in the pod spec. + +While the pod runs: +- The node CSI plugin receives a CSI `NodeGetVolumeStats` call. In response, the CSI +plugin passes the shared fs export path (corresponding to the volume) to the +Kubelet. Based on the response, the Kubelet invokes `RuntimeGetFileSystemStats` +on the microvm runtime's unix domain socket for handling CSI Runtime APIs. + +When the pod terminates: +- The runtime dismounts the filesystem in the sandbox environment +- The node CSI plugin receives CSI `NodeVolumeUnpublish` and cleanups any state +but skips dismounting the file system on the PV. + + +### Notes/Constraints/Caveats (Optional) + + + +Deferring filesystem mounts to the runtime may require restricted PVC access +modes. Specifically, ReadWriteOncePod will need to be specified on PVCs to +prevent multiple pods attempting to mount the same PVC when scheduled on the +same node if the PVC is bound to a block based PV with XFS or ext4 file system. +Such restrictions, however, are not necessary if the filesystem to be mounted +can support parallel mounts. + +In the context of microvm runtimes, using PVCs specifying access modes that allow +multiple writers like: ReadWriteOnce (which allows multiple pods in the same +node to mount a PV in read/write) and ReadWriteMany should be used with +caution depending on the isolation goals. Multiple writer access modes (within +the same or distinct nodes) allow a compromised/malicious pod to write data +into a shared volume that may affect: [1] other pods reading the data through +corrupt/fuzzed data and [2] other writers through denial of service attacks like +inode exhaustion on the volume. As a result, goals around isolation between pods +may be compromised. A cluster admin can configure a webhook/OPA policies to +restrict the set of access modes that can be specified on PVCs that are referred +by pods associated with a microvm runtime. + +A pod typically maps to the isolated sandbox environment in the context of +microvm runtimes. Individual containers in the pod within the sandbox are not +expected to be isolated with the same guarantees that exist across pods. To +align with these isolation goals, restrictions around the ability of multiple +containers within a pod mounting the same PV should not be necessary and +considered beyond the scope of this KEP. + + +### Risks and Mitigations + + + + +## Design Details + + + +### Enhancements to existing APIs + +As summarized in the Proposal section above, coordination of mounting and +management of the filesystem between a CSI plugin and a container runtime requires +enhancements to multiple APIs and components of a Kubernetes cluster. This section +delves into the details of each enhancement or addition. + +#### Enhancements in RuntimeClass + +New fields are necessary in RuntimeClass to indicate: [1] a handler/runtime can +support mounting and management of file systems on PVs mounted by a pod and +[2] specify a domain socket path which Kubelet can use to invoke the FileSystem +Management APIs (described later) that will be handled by the container runtime. + +``` +type RuntimeClass struct { + metav1.TypeMeta `json:",inline"` + ... + // SupportsFSMounts indicates whether the Handler supports mounting of + // FileSystem for persistent volumes referred by the pod + // Default False + SupportsFSMounts bool `json:"supportsFSMounts,omitempty" protobuf:"bytes,5,opt,name=supportsFSMounts"` + + // FSMgmtSocket specifies an absolute path on the host to a UNIX socket surfaced + // by the Handler over which FileSystem Management APIs can be invoked. Should + // not be specified if Handler does not support FileSystem Management APIs. + // +optional + FSMgmtSocket *string `json:"fsMgmtSocket,omitempty" protobuf:"bytes,6,opt,name=fsMgmtSocket"` +} +``` + +During the Alpha phase, the above fields can be introduced as annotations on +RuntimeClass. + +The Kubelet will evaluate these fields and indicate to the CSI plugin (as +described in following sections) that deferral of mounting and management of the +File System on a PersistentVolume to the container runtime handler is supported. + +#### Enhancements in CSI API Requests + +New fields are necessary in the CSI node API requests: `NodePublishVolumeRequest`, +`NodeGetVolumeStatsRequest` and `NodeVolumeExpandRequest` for the Kubelet (or another +Container Orchestrator from the orchestrator agnostic CSI perspective) to indicate +to a CSI plugin that the pod (that a PV is to be published to) is associated with +a runtime that supports mounting and management of the filesystem associated with +the PV. Based on these, a CSI plugin can decide whether to defer handling of the +API to the runtime and populate the appropriate fields in the corresponding +CSI API responses. + +Enhancements for `NodePublishVolumeRequest`: new field `runtime_supports_mount` +``` +message NodePublishVolumeRequest { + // The ID of the volume to publish. This field is REQUIRED. + string volume_id = 1; + ... + // Indicates Container Runtime supports mounting a file system + // This field is OPTIONAL. + bool runtime_supports_mount = 6 [(alpha_field) = true]; +} +``` + +Enhancements for `NodeGetVolumeStatsRequest`: new field `runtime_supports_stats` +``` +message NodeGetVolumeStatsRequest { + // The ID of the volume. This field is REQUIRED. + string volume_id = 1; + ... + // Indicates Container Runtime supports reporting stats and + // condition of the file system on the volume + // This field is OPTIONAL. + bool runtime_supports_stats = 4 [(alpha_field) = true]; +``` + +Enhancements for `NodeExpandVolumeRequest`: new field `runtime_supports_expand` +``` +message NodeExpandVolumeRequest { + // The ID of the volume to publish. This field is REQUIRED. + string volume_id = 1; + ... + // Indicates Container Runtime supports expanding the file system + // on the volume + // This field is OPTIONAL. + bool runtime_supports_expand = 7 [(alpha_field) = true]; +} +``` + +#### Enhancements in CSI API Responses + +Corresponding to the above enhancements in CSI API Requests, new fields are +necessary in the CSI node API responses: `NodePublishVolumeResponse`, +`NodeGetVolumeStatsResponse` and `NodeVolumeExpandResponse`. The CSI plugin needs to +indicate to the Kubelet (or another Container Orchestrator from the orchestrator +agnostic CSI perspective) that the runtime (associated with the pod mounting the +PV) should be involved in processing of the API along with relevant parameters +to identify the volume and process the API call. + +Enhancements for `NodePublishVolumeResponse`: new optional field `runtime_mount_info` +specifying details of how and what to mount on a block device or network share. +``` +message FileSystemMountInfo { + // Source device or network share (e.g. /dev/sdX, srv:/export) whose mount + // is deferred to a container runtime that is capable of executing the mount. + // This field is REQUIRED + string source = 1 [(alpha_field) = true]; + // Type of the filesystem to mount (e.g. xfs, ext4, nfs, ntfs) on the specified + // source. + // This field is REQUIRED + string type = 2 [(alpha_field) = true]; + // Mount options supported by the filesystem to be used for the specified source. + // This field is OPTIONAL. + map options = 3 [(alpha_field) = true]; +} + +message NodePublishVolumeResponse { + // Specifies details of how to mount a file system on a source device or + // network share when SP defers file system mounts to a container runtime. + // A SP MUST populate this if runtime_supports_mount was set in + // NodePublishVolumeRequest and SP is capable of deferring filesystem mount to + // the container runtime. + // This field is OPTIONAL + FileSystemMountInfo runtime_mount_info [(alpha_field) = true]; +} +``` + +In case of a block device backed PV, fields in NodePublishVolumeResponse would contain: +- `source` set to the name of the block device that the runtime should mount. +Example: /dev/sdf +- `options` set to filesystem specific options to be passed to mount. +Example: nobarrier +- `type` set to the filesystem to use for mounting. Example: xfs or ext4 for +Linux, ntfs for Windows + +In case of a NFS backed PV, fields in NodePublishVolumeResponse would contain: +- `source` set to the name of the block device that the runtime should mount. +Example: srv1.net:/exported/path +- `options` set to the NFS client options that need to be passed to mount.nfs +- `type` set to NFS + +Enhancements for `NodeGetVolumeStatsResponse`: new optional field `source` which, +if populated, will be passed by Kubelet (or another Container Orchestrator) to +the container runtime to retrieve stats associated with the file system and it's +condition. +``` +message NodeGetVolumeStatsResponse { + ... + // Source device or network share (e.g. /dev/sdX, srv:/export) whose mount + // is deferred to a container runtime that is capable of reporting stats for + // filesystems mounted by it. + // This field is OPTIONAL. + // It SHOULD be populated by a SP if runtime_supports_stats was set in + // NodeGetVolumeStatsRequest and SP is capable of deferring filesystem mount + // and stats requests to the container runtime. + // SP MUST NOT populate `usage` and `volume_conditions` fields if source is + // specified indicating deferral of stats to the container runtime. + string source = 3; [(alpha_field) = true]; +} +``` + +Enhancements for `NodeExpandVolumeResponse`: new optional field `source` which, +if populated, along with the existing field `capacity_bytes`, will be passed by +Kubelet (or another Container Orchestrator) to the container runtime to expand +the file system. +``` +message NodeExpandVolumeResponse { + ... + // Source device (e.g. /dev/sdX) whose mount is deferred to a container runtime + // that is capable of expanding the filesystems mounted by it. + // This field is OPTIONAL. + // It SHOULD be populated by a SP if runtime_supports_expand was set in + // NodeExpandVolumeRequest and SP is capable of deferring filesystem mount + // and expand requests to the container runtime. + // SP MUST populate `capacity_bytes` field with the desired capacity if source + // is specified indicating deferral of expansion to the container runtime. + string source = 2; [(alpha_field) = true]; +} +``` + +#### Enhancements in CRI Mount message + +The CRI `Mount` message (passed as part of `ContainerConfig` in +`CreateContainerRequest`) needs to be enhanced to pass the file system type and +mount options to be used by a container runtime as part of executing the deferred +mount from a CSI plugin. + +The name of the existing `host_path` field is changed to a more generic `source`. +This aligns better with the goal of being to support runtime assisted mounting for +block devices from the host as well as remote file systems which may not have a +path in the host as they can be directly mounted by the container runtime inside +the sandbox. The same reasoning applies for not adding the `type` and `mount` +fields to the `Device` message which is used to pass raw block device information +to the container runtime. + +``` +// Mount specifies a host volume to mount into a container. +message Mount { + ... + // Source can be a path in the host (for regular bind-mount + // scenarios), a block device or a NFS or CIFS server with + // exported path (for deferred mount scenarios). If the + // hostPath doesn't exist, then runtimes should report + // error. If the hostpath is a symbolic link, runtimes + // should follow the symlink and mount the real destination + // to the container. + string source = 2; + ... + // Type of the filesystem to mount: either bind or xfs, ext4, + // nfs, ntfs, etc. for deferred mount scenarios + string type = 6; + // Mount options corresponding to the filesystem to be mounted + map options = 7; +} +``` + +The new CRI fields: `type` and `options` will be populated by Kubelet (based on +`NodePublishVolumeResponse` from the CSI plugin). Entries in these fields will be +passed by a CRI runtime to the OCI runtime using the OCI (mounts)[https://github.com/opencontainers/runtime-spec/blob/master/config.md#mounts] field. + +Note that the OCI spec already allows the `type` and `options` fields to be +specified as part of the `mount` field. Therefore, no enhancements should be +necessary in the OCI spec. OCI runtimes supporting mount deferral need to be +able to execute the filesystem mount with the details specified in `mounts`. + +### New API for FileSystem Management Operations + +Filesystem management operations like supporting online expansion and reporting +condition and stats require coordination between a CSI plugin and a container +runtime handler. CSI plugins indicate deferral of these operation to the Kubelet +and a container runtime handler through the CSI NodeGetVolumeStatsResponse and +NodeExpandVolumeResponse as described above. The Kubelet, in response, will use +the new CSI Runtime API (described here) to invoke file system management +operations on the container runtime handler over a unix domain socket +specified in the RuntimeClass corresponding to the container runtime handler. + +Supporting the CSI Runtime API and surfacing the unix domain socket for invocation +of the API is expected to be the responsibility of the OCI runtime handler with +no involvement expected from the CRI runtime. + +``` +service Runtime { + option (alpha_service) = true; + + rpc RuntimeGetFileSystemStats(RuntimeGetFileSystemStatsRequest) returns + (RuntimeGetFileSystemStatsResponse) {} + + rpc RuntimeExpandVolume(RuntimeExpandVolumeRequest) returns + (RuntimeExpandVolumeResponse) {} +} + +message RuntimeGetFileSystemStatsRequest { + // Contains a string to identify a block device surfaced on the host or a + // shared file system export locator. CO MUST set this to the same value as + // reported in the source field in FileSystemMountInfo field of + // NodePublishVolumeResponse from SP + // This field is REQUIRED + string volume_source + + // Contains a string to identify the sandbox ID associated assigned by the + // container runtime and discovered by CO during sandbox creation + // This field is REQUIRED + string sandbox_id +} + +message RuntimeGetFileSystemStatsResponse { + // Contents of this message should be aligned to NodeGetVolumeStatsResponse + // This field is OPTIONAL. + repeated VolumeUsage usage = 1; + // Information about the current condition of the volume. + // This field is OPTIONAL. + // This field MUST be specified if the VOLUME_CONDITION node + // capability is supported. + VolumeCondition volume_condition = 2 +} + +message RuntimeExpandVolumeRequest { + // Contains a string to identify a block device surfaced on the host or a + // shared file system export locator. CO MUST set this to the same value as + // reported in the source field in FileSystemMountInfo field of + // NodePublishVolumeResponse from SP + // This field is REQUIRED + string volume_source + + // Contains a string to identify the sandbox ID associated assigned by the + // container runtime and discovered by CO during sandbox creation + // This field is REQUIRED + string sandbox_id + + // Contains the desired size of the file system. CO MUST set this to the same + // value as reported in capacity_bytes field of NodeExpandVolumeResponse. + // This field is REQUIRED + int64 capacity_bytes = 1; +} + +message RuntimeExpandVolumeResponse { + // Contents of this message should be aligned to NodeExpandVolumeResponse + // The capacity of the volume in bytes. This field is OPTIONAL. + int64 capacity_bytes = 1; +} +``` + +### Limitations of the Design + +In case of certain shared FS scenarios (like SMB), secrets associated with +mounting the FS may need to be passed to the OCI runtime to enable it to +authenticate. However, the configuration containing the OCI mount options may be +persisted on the host file system by the CRI runtime (as described (here)[https://github.com/containerd/containerd/issues/2426] in case +of containerd). Based on the security posture of the host, it may not be +recommended to enable runtime assisted mounts if secrets persisted on the host +file system is undesirable. Future enhancements in container runtimes to pass +the OCI spec in memory will address this limitation. + + +### Test Plan + + + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +### Alternative 1 - Skipping Kubelet involvement during mount deferral through metadata files +A CSI plugin can skip mounting of the Filesystem and pass the mount info to the +runtime through a specially named metadata file placed in the publish path that +gets passed to the runtime today as mount source. This approach does not require +any changes to Kubelet and CSI spec. However, this approach does not provide a +definitive way for a runtime to determine the authenticity of the metadata file. +A legitimate CSI plugin can be made to surface a malicious metadata file for the +container runtime to consume. Through this, an unauthorized user may get access +to an arbitrary block device on the host or NFS share that the container runtime +can mount based on the information in the malicious metadata file. + +This approach was the starting point in the Kata community: https://github.com/kata-containers/kata-containers/pull/1568 +that pivoted into this KEP to address the security concern mentioned above around +authenticating the validity of the metadata file within the runtime. + + +### Alternative 2 - Using Devices instead of Mounts in CRI and OCI +Today, raw block devices are mapped within a container by using the Device +message/struct in CRI and OCI. That message/struct could be enhanced with FS mounts +and options to enable runtime assisted mounting of block devices. However shared +FS scenarios do not fit well into that model. + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-storage/2857-runtime-assisted-pv-mounts/kep.yaml b/keps/sig-storage/2857-runtime-assisted-pv-mounts/kep.yaml new file mode 100644 index 000000000000..8a29114ac7a5 --- /dev/null +++ b/keps/sig-storage/2857-runtime-assisted-pv-mounts/kep.yaml @@ -0,0 +1,43 @@ +title: Runtime assisted mounting of Persistent Volumes +kep-number: 2857 +authors: + - "@ddebroy" + - "@yibozhuang" + - "@egernst" +owning-sig: sig-storage +participating-sigs: + - sig-node +status: provisional +creation-date: 2021-08-13 +reviewers: + - "@jsafrane" + - "@pohly" +approvers: + - "@jsafrane" + +see-also: +replaces: + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.23" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.24" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: RuntimeAssistedMount + components: + - kube-apiserver + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: