Add spec change for volume mount group #468

gnufied · 2021-01-20T16:56:21Z

Adds an alpha field to NodeStage and NodePublish which would allow CO to supply group identifier of workload, so as volume can have group permissions when published to a node.

Fixes #449

bswartz

I feel like there should be a clarification near the top or bottom of a spec that explains that when we say "all the files and directories are readable and writable" that that is a continuous expectation for the life of the publish, not a point in time expectation at publish time. That clarification would rule out the chown/chmod based tricks.

csi.proto

spec.md

gnufied · 2021-01-28T22:06:30Z

spec.md

+      // It is expected that SP SHOULD use provided volume_mount_group
+      // for mounting the volume and volume should remain readable and
+      // writable by workloads associated with volume_mount_group until
+      // corresponding NodeUnstageVolume or NodePublishVolume is called.


@bswartz how does this sound to indicate lifetime of group based permissions?

bswartz · 2021-02-10T20:36:20Z

csi.proto

+  // provided volume_mount_group and all files and directories
+  // within the volume are readable and writable by the provided
+  // volume_mount_group.
+  // The value of volume_mount_group should be group_id or group name


I think we have to be clear whether COs may specify a name or an ID here. If we can avoid specifically mentioning UNIX and Windows, and use language that applies equally well to both, that would be ideal.

For example, we could say it must be the ID of the group. On UNIX that would imply decimal-formatted GID string. On Windows that would imply a SID string. If we say that names are also acceptable, then I would expect that to be true on both Wndows and UNIX.

I want to be unambiguous what the CO is required to pass down here, and I want it to be unambiguous what the SP need to be prepared to deal with here.

reworded as we discussed offline. Now this reads:

// The value of volume_mount_group should be group
// identifier (as determined by underlying operating system)
// which would be associated with workload that uses the volume.

bswartz · 2021-02-10T20:39:21Z

csi.proto

+  // within the volume are readable and writable by the provided
+  // volume_mount_group.
+  // If NodeStageVolume was previously called with volume_mount_group
+  // CO MUST ensure that NodePublishVolume uses the same


I thought we said here this is a SHOULD not a MUST. I think the semantics we want is that, if the CO does pass the same volume_mount_group as it did to stage, then the SP MUST NOT fail for reason related to the group. If the CO passes a different group, then the operation MAY fail and that should be communicated in the error code.

I thought - if NodePublish gets called with different value than NodeStage - it just isn't going to work for ANY storage provider because volume was already staged with a different group identifier, so there is no point in using SHOULD language there.

The place where we wanted to use SHOULD or MAY is - if SP did not had NodeStage but two or more NodePublish was called with different volume_mount_group for same volume (on same node) but different target paths. Such a case MAY or MAY not be supported by SP and hence I covered this case with an error code:

| Volume already published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the volume with specified volume_id was already node published using different volume_mount_group on this node and can not be node published. | Caller MUST ensure that NodePublishVolume is called with same volume_mount_group on all target paths. |

bswartz · 2021-02-10T20:40:28Z

csi.proto

+      // Indicates that Node service supports mounting volumes
+      // with provided volume group identifier during node stage
+      // or node publish RPC calls.
+      // It is expected that SP SHOULD use provided volume_mount_group


We have to say here that if the capability is asserted, and the CO passes a non-nil group, then it MUST be respected. I don't think SHOULD is the right language here.

gnufied · 2021-02-11T20:20:38Z

cc @jdef @saad-ali @msau42

jdef

Still waiting for morning coffee to kick in, but I've taken a stab at some initial feedback anyway.

jdef · 2021-02-12T12:29:13Z

spec.md

+
+  // If SP has VOLUME_MOUNT_GROUP node capability and CO provides
+  // this field then SP MUST ensure that volume is mounted with
+  // provided volume_mount_group and all files and directories


I'm concerned that the wording "all files and directories within the volume are readable and writable by ..." leaks implementation detail, thinking that it should be removed. In terms of guarantees, the spec should aim no higher than what's provided by mount (and the OS driver implementation). Driver implementations of mount are already a black box and it feels odd to layer additional requirements on top of a underspecified interface. Furthermore, just because k8s wants to implement a recursive chown doesn't mean that it's appropriate for all CSI driver implementations, or all CO implementations/deployments.

Also, removing the recursive chown wording makes this more interesting for block volumes. Perhaps ownership means something for those volumes too? And if so, do we really want to tie this to "mount", or maybe we need a different word if "mount" is only for fs-type volumes (and vs block-type volumes).

Furthermore, just because k8s wants to implement a recursive chown doesn't mean that it's appropriate for all CSI driver implementations, or all CO implementations/deployments.

Just to clarify - if a driver has this capability, k8s WILL not chown/chmod files belonging to this volume. In stead the plan is - k8s will assume SP has done its job and volume has right permissions after nodestage/nodepublish. Also - the reason we need this as a mount flag is mainly because in many cases k8s/CO can not recursively chown files (for example - cifs mount without unix extensions).

'm concerned that the wording "all files and directories within the volume are readable and writable by ..." leaks implementation detail, thinking that it should be removed.

Yes the wording does gets into implementation details. My reasoning as originally discussed with @bswartz was - we do not want SPs to use this field to recursively chown/chmod files at all. Because chown/chmod is very much CO specific thing and we actively wanted to prevent SPs from doing that. When using chown for example - newly written files on the volume may not be group writable and hence would violate the CSI spec. But I can see - how this leaks into mount implementation details too.

Still not happy with this wording:

... then SP MUST ensure that volume is mounted with provided volume_mount_group and all files and directories within the volume are readable and writable by the provided volume_mount_group.

(a) what does "mounted with provided volume_mount_group" actually mean?
(b) all files + dirs readable/writable sounds like a recursive chown to me, unless this is a validation check? which doesn't really make sense in this context.

I'd like to take a stab at some alt wording, which attempts to decrease the scope of what CSI is willing to promise:

"... then the SP MUST ensure that the volume_mount_group parameter is passed as the group (primary group, if possible) identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, new file and/or directory entries written to the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are both readable and writable by said mount group identifier."

.. which tries to get away from saying that everything on a preexisting volume will be read/write-able (which is gets into recursive chown territory), and also tries to cut to the chase of the requirement that I think i'm hearing - that newly written blobs are the concern here, and they should (by default) be labeled with the group ID presented here.

also, if a workload chown's files on the volume such that it can no longer read them .. well, tough luck. CSI is not a data plane API. it's up to the SP implementation and/or OS to restrict such operations if that behavior is undesireable. CSI should not promise anything w/ respect to what I/O operations a workload is allowed to execute (because CSI is a control plane API).

I think updated wording looks fine. The only question I have is - why primary group if possible? k8s at least will almost never pass primary group of the workload.

I was thinking some more about it and I agree that your updated wording meets the requirement and everything, but one more thing which has been bothering me somewhat is - we don't really need to call out permissions of new files/directories. The wording does imply that - chown/chmod may not be enough but what we really want SPs to use is - mount options and hoping that mount operation with provided group identifier is enough to apply necessary permissions.

So I am thinking two alternative wordings:

"...then the SP MUST ensure that the volume_mount_group parameter is passed as the group identifier to the underlying OS mount system call,
with the understanding that this would result in volume being readable(if published as readonly) or both readable and writable by said mount group identifier. The set of available mount call parameters and/or mount implementations may vary across operating systems." --> This covers readonly interaction as well.

Or something like:

"... then the SP MUST ensure that the volume_mount_group parameter is passed as the group identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, it is expected file and/or directory entries on the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are either readable(in case of readonly volumes) or both readable and writable by said mount group identifier." --> covers interaction with readonly

My wording says the permission labeling (ala the group option) targets new file(s)/dir(s); your wording doesn't include the "new", which in my mind translates to "recursive chown".

If you did not intend to drop "new" from the wording, please add it back. Because maybe you're only trying to amend my wording to account for (a) readable files in the read-only case, or else; (b) read/write-able files otherwise?

My main concern with accounting for "new" files is - are we being too specific? A SP may violate this and still satisfy what CO expects. But to be clear we do not have that problem at least with Azurefile/CIFS-Samba mounts. It works exactly like how you worded and it should be fine. So may be I am over-thinking this.

alright - I have incorporated this wording. I thought again about readonly case and I do not think it is important to call that out here again, because readonly can be enforced in different ways than by combining with gid mount options.

jdef · 2021-02-12T12:31:23Z

spec.md

+
+  // If SP has VOLUME_MOUNT_GROUP node capability and CO provides
+  // this field then SP MUST ensure that volume is mounted with
+  // provided volume_mount_group and all files and directories


ditto re: recursive chown requirement

jdef · 2021-02-12T12:38:46Z

spec.md

@@ -2260,6 +2288,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t
 | Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability` or `readonly` flag. | Caller MUST fix the arguments before retrying. |
 | Exceeds capabilities | 9 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
 | Staging target path not set | 9 FAILED_PRECONDITION | Indicates that `STAGE_UNSTAGE_VOLUME` capability is set but no `staging_target_path` was set. | Caller MUST make sure call to `NodeStageVolume` is made and returns success before retrying with valid `staging_target_path`. |
+| Volume staged with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that volume with specified `volume_id` was node staged using different `volume_mount_group` on this node and hence can not be node published. | Caller MUST make sure that `NodePublishVolume` is called with same `volume_mount_group` which was used in `NodeStageVolume`. |
+| Volume already published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the volume with specified `volume_id` was already node published using different `volume_mount_group` on this node and can not be node published. | Caller MUST ensure that `NodePublishVolume` is called with same `volume_mount_group` on all target paths. |


(a) if this is case is to cover the scenario when node-stage is unimplemented, please say so explicitly.
(b) unclear whether this behavior is "optional", meaning that if some implementations support multiple publish calls w/ different groups are they still required to throw this error? from earlier review comments in the spec it seems like this was to be optionally supported, but the language here does not make that clear.

I have added some language to clarify the intent - 94346e3#diff-bc6661da34ecae62fbe724bb93fd69b91a7f81143f2683a81163231de7e3b545R2267

So I have gone back little bit on this one and updated the wording to be same for both nodestage->nodepublish and nodepublish->nodepublish. Originally, the wording said CO MUST use same volume_mount_group identifier with NodePublish which was used during NodeStage. I have relaxed with following wording:

// If Plugin does not support NodePublishVolume with different // volume_mount_group than the one used during NodeStageVolume // then Plugin MAY return FAILED_PRECONDITION error. // Similarly if SP does not support NodePublishVolume of same volume // on same node but with different volume_mount_group it MAY return // FAILED_PRECONDITION error.

The idea is - some plugins may not perform actual mount on NodeStage and they may only do some housekeeping during staging and hence if underlying driver supports it, multiple NodePublish of same volume can use different gid than what was specified on NodeStage. It did not make sense to penalize a Plugin for implementing NodeStage.

cc @bswartz ^

jdef · 2021-02-12T12:45:10Z

spec.md

+  // identifier (as determined by underlying operating system)
+  // which would be associated with workload that uses the volume.
+  // This is an OPTIONAL field.
+  string volume_mount_group = 9;


so the next thing that someone will ask for is "mount user". and then someone else will want "mount SELinux labels". and then "extended ACLs". and so on.

at the risk of over-engineering a solution here, i'm wondering if it might make sense to have a top-level type, e.g. VolumeOwnership to capture the intent:

message VolumeOwnership { string primary_group = 1; }

and then the field in this message becomes:

VolumeOwnership volume_mount_owner = 9;

The concern about mount_user is valid. We may need that in future. SELinux however is solvable without any CSI spec change - see - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/1710-selinux-relabeling/README.md

Also the reason we need this as explicit field in nodestage/nodepublish is because - unlike selinux mount options which has a standard key/value pair (like mount -o context=<>), the group mount option has no such standard format. So CO does not know how to format it.

Did we conclude on this? I think it would be nice to leave room for future options.

hmm, so I was trying to update the spec for moving volume_mount_group to its own message type, but there are some issues to think about.

Currently the spec says, if volume was staged with one group identifier, it can't be node published with different group identifier on same node. This makes sense for group identifier but what if in future it does not make sense for uid or something else?

Using SELinux as an exmaple -if we were to add selinux_context to VolumeOwnership message then, it would be an error to NodePublish a volume with different selinux contexts on same node, it just won't work. But it does work for group identifier for many CIFS implementations at least.

So bringing all of ownership related options within one message can also complicate such future changes. What if behaviour of those ownership related flags are different and their interaction with CO and workload is different?

Does the restriction have to apply at the higher level VolumeOwnership or can it apply on individual fields?

I would expect restrictions to apply to individual fields but error handling and recovery has to be appropriately adjusted so as it can apply to individual fields.

The trouble is basically - appropriately wording VolumeOwnership type which can apply to both NodeStage and NodePublish and then defining errors and recovery for both stage and publish in same place and type and then ensuring that error handling and recovery appropriately applies to all the fields and is easy to extend in future.

This IMO leads to excessively long definition of type volume_mount_group (because among other things now we have to define both publish and stage time behaviour) for example.

It is not impossible but I am not sure if this is worth the tradeoff. @jdef @msau42 what do you think?

we could consider other naming:

message VolumePrincipal { string group = 1; }

we can certainly do this. Did we want this as part of VolumeCapability field?

bswartz · 2021-02-17T15:09:36Z

@jdef My suggestion was not to expose implementation detail, but to highlight the fact that the ownership should be a continuous thing over the life of the publish, and not a discrete one time action at publish time. We shouldn't care how the plugin achieves that, but my goal was to rule out things like a chown.

bswartz · 2021-02-17T20:14:59Z

I am happy with this wording. Any other outstanding changes before we approve?

gnufied · 2021-02-17T20:57:41Z

@bswartz I added additional language in NodePublishVolume RPC call that clarifies concerns @jdef raised here - #468 (comment)

jdef · 2021-02-19T18:10:27Z

spec.md

+
+  // If SP has VOLUME_MOUNT_GROUP node capability and CO provides
+  // this field then SP MUST ensure that volume is mounted with
+  // provided volume_mount_group and all files and directories


Still not happy with this wording:

... then SP MUST ensure that volume is mounted with provided volume_mount_group and all files and directories within the volume are readable and writable by the provided volume_mount_group.

(a) what does "mounted with provided volume_mount_group" actually mean?
(b) all files + dirs readable/writable sounds like a recursive chown to me, unless this is a validation check? which doesn't really make sense in this context.

I'd like to take a stab at some alt wording, which attempts to decrease the scope of what CSI is willing to promise:

"... then the SP MUST ensure that the volume_mount_group parameter is passed as the group (primary group, if possible) identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, new file and/or directory entries written to the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are both readable and writable by said mount group identifier."

.. which tries to get away from saying that everything on a preexisting volume will be read/write-able (which is gets into recursive chown territory), and also tries to cut to the chase of the requirement that I think i'm hearing - that newly written blobs are the concern here, and they should (by default) be labeled with the group ID presented here.

also, if a workload chown's files on the volume such that it can no longer read them .. well, tough luck. CSI is not a data plane API. it's up to the SP implementation and/or OS to restrict such operations if that behavior is undesireable. CSI should not promise anything w/ respect to what I/O operations a workload is allowed to execute (because CSI is a control plane API).

jdef · 2021-02-19T19:44:49Z

Feel free to strike the "primary", I think I really wanted "default" which I think is captured later on

…

On Fri, Feb 19, 2021 at 2:41 PM Hemant Kumar ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec.md <#468 (comment)> : > @@ -2091,6 +2091,17 @@ message NodeStageVolumeRequest { // This field is OPTIONAL and MUST match the volume_context of the // volume identified by `volume_id`. map<string, string> volume_context = 6; + + // If SP has VOLUME_MOUNT_GROUP node capability and CO provides + // this field then SP MUST ensure that volume is mounted with + // provided volume_mount_group and all files and directories I think updated wording looks fine. The only question I have is - why primary group if possible? k8s at least will never pass primary group of the workload. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#468 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR5KLFR4P42KZR6KQOUBWTS725HBANCNFSM4WK7AOLA> .

-- James DeFelice 585.241.9488 (voice) 650.649.6071 (fax)

jdef · 2021-03-03T03:52:54Z

I need to think through the error cases a bit more. I'm a bit concerned about the complexity added here. I'd rather err on the side of caution, meaning more conservative .. sacrificing flexibility for simplicity.

…

On Tue, Mar 2, 2021, 10:47 PM Hemant Kumar ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec.md <#468 (comment)> : > @@ -2260,6 +2288,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t | Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability` or `readonly` flag. | Caller MUST fix the arguments before retrying. | | Exceeds capabilities | 9 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. | | Staging target path not set | 9 FAILED_PRECONDITION | Indicates that `STAGE_UNSTAGE_VOLUME` capability is set but no `staging_target_path` was set. | Caller MUST make sure call to `NodeStageVolume` is made and returns success before retrying with valid `staging_target_path`. | +| Volume staged with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that volume with specified `volume_id` was node staged using different `volume_mount_group` on this node and hence can not be node published. | Caller MUST make sure that `NodePublishVolume` is called with same `volume_mount_group` which was used in `NodeStageVolume`. | +| Volume already published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the volume with specified `volume_id` was already node published using different `volume_mount_group` on this node and can not be node published. | Caller MUST ensure that `NodePublishVolume` is called with same `volume_mount_group` on all target paths. | cc @bswartz <https://github.com/bswartz> ^ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#468 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR5KLDFM4TEDH7CJQQACB3TBWWNLANCNFSM4WK7AOLA> .

gnufied · 2021-03-03T04:14:44Z

@jdef thanks. Do you think we can get this in for next CSI release (i.e 1.4) or it is better suited for 1.5 ? I need to make some k8s changes and if we can't make this in 1.4 then I would rather avoid extra work.

gnufied · 2021-03-03T22:41:10Z

Posting an update since this PR will likely miss the cut - I am waiting for some explanation from @jdef on #468 (comment) . We could make the value of volume_mount_group more restricted but it would prevent use cases where a driver may support node publishing same volume multiple times with different gid (i.e if I understood james correctly).

gnufied · 2021-04-28T15:58:46Z

@jdef so one thing we can do to sacrifice some of the flexibility and reduce two different errors in one is - enforce that for same volume_id, both node stage and node publish and multiple instances of node publish on same node, has to always use same volume_mount_group. That is - make it entirely unsupported to specify different volume_mount_group for same volume, if volume was previously staged or published with different volume_mount_group.

This will allow us to squash two different errors in one:

 Volume already published or staged with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the 
volume with specified `volume_id` was already node published or staged using different `volume_mount_group` on this node and 
can not be node published. | Caller MUST ensure that `NodePublishVolume` is called with same `volume_mount_group` on all 
target paths. |

gnufied · 2021-04-29T18:49:41Z

@jdef @saad-ali alright the updated scope is more narrower now. The spec now says all instances of NodePublishVolume must use same gid value as previous instances of NodePublish or NodeStage RPC calls for same volume. This also simplifies error handling. PTAL.

saad-ali · 2021-05-12T07:22:40Z

Thanks @gnufied

Can you provide examples of a couple of CSI Driver and how they would implement this both for linux and for windows? e.g. "AzureFile will convert this in to a mount --foo in linux and (?) in windows and CSI driver ... will do ..."

Sorry if this was mentioned else where, just want to wrap my head around the change.

gnufied · 2021-05-21T02:45:34Z

On Linux Azurefile CSI driver would convert the supplied volume_mount_group to a gid during NodeStage call:

mount -o gid=volume_mount_group

On Windows currently at least in k8s, there is no implementation of using fsgroups/gids for pods. But when I spoke with @ddebroy on slack, he suggested that supplied group_id can be a SID that can be used to lock down volume for specific security identifier (https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/security-identifiers-in-windows).

saad-ali · 2021-05-24T07:58:53Z

I spoke with @gnufied offline about this feature. He explained:

The need for fsgroup in CSI rpc call is because of certain volume types, the only way to change volume permissions is during mount via mount option. you can't recursively chown/chmod these files because azurefile for example, does not support unix extensions.
we could carry hardcode just for azurefile CSI driver in kubelet. but it will be super ugly
and it is not just azurefile. there will be other CIFS/samba implementation that will need something similar

In general I worry about adding any permissions logic to CSI for a couple of reasons: 1) I question if manipulating file permissions actually adds any real security, and 2) ensuring file permissions are added in a way that will work across all OS/filesystem combos will be hard.

For my first concern, @gnufied explained that many apps won't work with world readable files, and so "meeting the user where they are" is one reason we need fsgroup functionality, and the other argument is security* (with an asterix because you only get security benefits if you also limit the ranges pods can have, and/or dynamically assign the ranges).

For my second concern, @gnufied explained this approach should work for both windows and linux at least for azurefile.

Given how important this is for AzureFile, I'm keen to approve this. @jdef @ddebroy any objections?

ddebroy · 2021-05-26T16:06:18Z

LGTM. No objections based on relevance of this for Azurefile as well as any other CSI plugin that leverages SMB.

humblec · 2021-05-27T05:08:13Z

One thing which does not give me clarity here is the reference to VOLUME_MOUNT_GROUP and what exactly it refers to, when we say CO passes VOLUME_MOUNT_GROUP , are we referring to GID ( effective) or Supplemental GID here (https://www.openshift.com/blog/a-guide-to-openshift-and-uids) ? We have seen many issues in the past due to permission denied errors on the volume from app workloads especially for RWX volumes. One way to solve this was, passing the GID via PV annotation which apparently get added to SGID group array by the CO. imo, the spec has to make it clear that, in which context of GID ( effective/supplemental) VOLUME_MOUNT_GROUP supposed to operate from CO and SP level? @gnufied @saad-ali @bswartz @jdef

jdef

thanks for following up on this. i think this proposal needs a bit more work, left some feedback. i'm a bit concerned that the OP still has questions about integration/e2e flows between CO and SP. i'm wondering if we'd benefit from some kind thought-experiment-in-the-form-of-docs that walks through a hypothetical CO/SP integration, calling out requisite configuration and how that manifests at runtime via CSI calls, and resulting system state.

spec.md

jdef · 2021-05-27T11:40:04Z

spec.md

@@ -2315,6 +2348,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t
 | Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability` or `readonly` flag. | Caller MUST fix the arguments before retrying. |
 | Exceeds capabilities | 9 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
 | Staging target path not set | 9 FAILED_PRECONDITION | Indicates that `STAGE_UNSTAGE_VOLUME` capability is set but no `staging_target_path` was set. | Caller MUST make sure call to `NodeStageVolume` is made and returns success before retrying with valid `staging_target_path`. |
+| Volume staged or published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that volume with specified `volume_id` was node staged or published using different `volume_mount_group` on this node and hence can not be node published. | Caller MUST make sure that all `NodePublishVolume` calls specify same value of `volume_mount_group`. |


have we considered how this feature works in light of the new single_node_multi_writer functionality being proposed?

Currently the definition of volume_mount_group is more strict and requires that all node-publish RPC calls use same value of volume_mount_group. So basically - if T1!=T2 and P1!=P2 (where Pn includes volume_mount_group) then - for single_node_multi_writer volumes, node-publish should fail.

Having said that - we can relax the wording once #476 merges or if this merges first then we can update #476. The relaxed wording will be - if SP supports both SINGLE_NODE_MULTI_WRITER and VOLUME_MOUNT_GROUP capability then it MUST allow NodePublishing same volume multiple times on same node with different volume_mount_group. The cifs implementations that I know already support that (you can mount same volume multiple times on the node with different gid).

The language here is however more strict - because we wanted to start with something strict and then relax the wording if needed (an earlier version of design did allow aformentioned flow).

jdef · 2021-05-27T11:43:18Z

spec.md

@@ -2134,6 +2134,19 @@ message NodeStageVolumeRequest {
  // This field is OPTIONAL and MUST match the volume_context of the
  // volume identified by `volume_id`.
  map<string, string> volume_context = 6;
+


this field is declared both for node-stage, and for node-publish. why not declare it ONCE as part of VolumeCapability ?

or, if this is only for mount .. then perhaps as part of MountVolume?

and what does this field mean in the context of block vols?

It is not possible to use mount flags defined in volume capability (inside MountVolume) because:

The flags defined inside volume capability is static and part of CreateVolume response - whereas this mount option depends on workload that is trying to use the volume and could change.

Say even if we pretend that volume capability can be modified at any point in time and is not tied to Volume object. How does CO know it should supply -o gid=<> mount flag during NodeStage or NodePublish?

hmm - I think your response meant the flag not be included as mount_flags inside MountVolume type - but rather a separate field in MountVolume type? I misunderstood it to be the former.

Yes having a separate volume_mount_group field inside MountVolume could work. It would have same semantic meaning as when it is defined outside (as current PR). The main concern though is - it would perhaps alter meaning of volume capability. But if that is acceptable trade-off I can make that change.

I have moved this field to VolumeCapability with everything else being the same. I think this also allows us to cleanup the error handling and now error handling simply is determined by based on the criteria if SP supports publishing a volume with different capabilities on same node. It also addresses interaction of this PR with #476

jdef · 2021-05-27T11:46:51Z

spec.md

+  // identifier (as determined by underlying operating system)
+  // which would be associated with workload that uses the volume.
+  // This is an OPTIONAL field.
+  string volume_mount_group = 9;


we could consider other naming:

message VolumePrincipal { string group = 1; }

gnufied · 2021-05-27T14:23:01Z

thanks for following up on this. i think this proposal needs a bit more work, left some feedback. i'm a bit concerned that the OP still has questions about integration/e2e flows between CO and SP.

Trying to address both @humblec and @jdef concerns here - there is a k8s KEP which describes how this will be used - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2317-fsgroup-on-mount/README.md . Beyond that, I have written some code that uses this new field and integration with CO and SP is pretty straightforward.

@humblec the supplemental groups and fsgroups are CO's internal implementation details that do not necessarily bubble up to CSI spec. But for completeness sake, I will try and clarify here - in k8s supplemental groups are additional groups that are added to a pod (not the primary GID). They can be used to match gid of the underlying volume, so as k8s does not have to perform any permission/ownership change if supplemental group already match volume's underlying group ownership/permissions. Supplemental groups do not result in any permission change of the volume and hence is not useful with problem we are trying to solve.

humblec · 2021-05-27T16:55:01Z

thanks for following up on this. i think this proposal needs a bit more work, left some feedback. i'm a bit concerned that the OP still has questions about integration/e2e flows between CO and SP.

Trying to address both @humblec and @jdef concerns here - there is a k8s KEP which describes how this will be used - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2317-fsgroup-on-mount/README.md . Beyond that, I have written some code that uses this new field and integration with CO and SP is pretty straightforward.

@humblec the supplemental groups and fsgroups are CO's internal implementation details that do not necessarily bubble up to CSI spec.

Sure, supplemental group is a general concept (*nix) too, so brought it up.

But for completeness sake, I will try and clarify here - in k8s supplemental groups are additional groups that are added to a pod (not the primary GID). They can be used to match gid of the underlying volume, so as k8s does not have to perform any permission/ownership change if supplemental group already match volume's underlying group ownership/permissions.

True, made use of that heavily in GlusterFS plugin under kube CO to overcome permission denied error and such.

Supplemental groups do not result in any permission change of the volume and hence is not useful with problem we are trying to solve.

In short, the VOLUME_MOUNT_GROUP is the effective GID which CO/SP can make use of , also supplemental Group workflow is unaffected/untouched. Thanks for clarifying @gnufied 👍

andyzhangx · 2021-06-02T05:43:52Z

@gnufied with this interface change, and also k/k code change, the following fsGroup value would be passed to NodeStageVolume and NodePublishVolume, right?

spec:
  securityContext:
    fsGroup: 2000

gnufied · 2021-06-02T13:58:41Z

@andyzhangx yes that is accurate.

gnufied · 2021-06-04T19:21:36Z

@jdef @saad-ali @bswartz @msau42 please take another look. As discussed #468 (comment) , I have moved the new field to VolumeCapability . This will work for our use case and it simplifies some of the API choices we had to make.

bswartz

/lgtm

saad-ali · 2021-06-04T21:12:28Z

@jdef @humblec can you PTAL?

saad-ali · 2021-06-04T21:55:38Z

PTAL ASAP trying to get this merged and new release cut on Monday.

humblec · 2021-06-05T04:23:59Z

@gnufied with this interface change, and also k/k code change, the following fsGroup value would be passed to NodeStageVolume and NodePublishVolume, right?
spec:
  securityContext:
    fsGroup: 2000

@gnufied just to complete my understanding, if the volume is RWX ( multi node multi writer) and 2 different PODs scheduled on 2 different nodes by specifying different fsgroup values , Is SP supposed to respect these values at mount for this same underlying volume?

humblec · 2021-06-05T04:33:22Z

@jdef @humblec can you PTAL?

Sure @saad-ali, the change looks good to me !! 👍
I have one side query here (#468 (comment)), but should not be a blocker. Thanks !

jdef

LGTM mod nits

In the case of the SINGLE_NODE_MULTI_WRITER flow (another PR looking to land ASAP), is it legal to mount the same volume to multiple workloads on the same node w/ different volume_mount_group's ? If not, should it be? I have my doubts as to whether all SPs (or even OSs) can support this. May be worth calling out somewhere/somehow as an error case that CO will need to deal with.

csi.proto

gnufied · 2021-06-07T14:31:25Z

@jdef the SINGLE_NODE_MULTI_WRITER PR has wording that says - a CO SHOULD not call NodePublishVolume with different volume_capability for same volume - https://github.com/container-storage-interface/spec/pull/476/files#diff-bc6661da34ecae62fbe724bb93fd69b91a7f81143f2683a81163231de7e3b545R2251

I can add more explicit wording - but I thought that language covers this limitation well enough.

@humblec Generally speaking yes it is possible that CO will call node-stage/node-publish with different volume_mount_group on different nodes for same RWX volume. Plug-in that need this feature as far as I know support that behaviour. If it isn't supported by SP - it may return Exceeds capabilities.

saad-ali

/lgtm

LGTM. No objections based on relevance of this for Azurefile as well as any other CSI plugin that leverages SMB.

LGTM mod nits

Sure @saad-ali, the change looks good to me !! 👍

Thanks folks! Will go ahead and merge this.

csi.proto

Apply suggestions from code review Co-authored-by: jdef <2348332+jdef@users.noreply.github.com>

saad-ali · 2021-06-07T19:23:53Z

Thanks for the rebase. Merging now.

gnufied force-pushed the add-gid-publish branch 2 times, most recently from 3c23d41 to e2f95a6 Compare January 27, 2021 16:19

bswartz requested changes Jan 27, 2021

View reviewed changes

csi.proto Outdated Show resolved Hide resolved

spec.md Outdated Show resolved Hide resolved

gnufied commented Jan 28, 2021

View reviewed changes

gnufied force-pushed the add-gid-publish branch from 061c7f6 to 14ea8bc Compare January 28, 2021 22:08

bswartz requested changes Feb 10, 2021

View reviewed changes

jdef reviewed Feb 12, 2021

View reviewed changes

jdef requested changes Feb 19, 2021

View reviewed changes

gnufied mentioned this pull request Mar 2, 2021

azure file migration go beta kubernetes/kubernetes#96293

Merged

gnufied force-pushed the add-gid-publish branch from ff3cecc to b7deef5 Compare March 3, 2021 03:50

gnufied force-pushed the add-gid-publish branch 3 times, most recently from b3f3c6f to 98893c5 Compare April 29, 2021 18:47

jdef reviewed May 27, 2021

View reviewed changes

gnufied force-pushed the add-gid-publish branch from 98893c5 to 18af4e5 Compare May 27, 2021 17:17

bswartz reviewed Jun 4, 2021

View reviewed changes

jdef reviewed Jun 6, 2021

View reviewed changes

csi.proto Show resolved Hide resolved

csi.proto Show resolved Hide resolved

csi.proto Show resolved Hide resolved

csi.proto Show resolved Hide resolved

gnufied force-pushed the add-gid-publish branch from 317bee4 to 57ad404 Compare June 7, 2021 14:13

saad-ali approved these changes Jun 7, 2021

View reviewed changes

csi.proto Outdated Show resolved Hide resolved

Add spec change for volume mount group

3715e5f

Apply suggestions from code review Co-authored-by: jdef <2348332+jdef@users.noreply.github.com>

gnufied force-pushed the add-gid-publish branch from 57ad404 to 3715e5f Compare June 7, 2021 19:21

saad-ali merged commit f1ae024 into container-storage-interface:master Jun 7, 2021

saad-ali mentioned this pull request Jun 7, 2021

Add force detach #477

Open

gnufied mentioned this pull request Jun 16, 2021

Implement gid support as per CSI 1.5 kubernetes-sigs/azurefile-csi-driver#708

Closed

Add spec change for volume mount group #468

Add spec change for volume mount group #468

Conversation

gnufied commented Jan 20, 2021 • edited Loading

bswartz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Feb 11, 2021

jdef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Feb 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Feb 19, 2021 • edited Loading

Choose a reason for hiding this comment

gnufied Feb 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bswartz commented Feb 17, 2021

bswartz commented Feb 17, 2021

gnufied commented Feb 17, 2021

Choose a reason for hiding this comment

jdef commented Feb 19, 2021 via email

jdef commented Mar 3, 2021 via email

gnufied commented Mar 3, 2021 • edited Loading

gnufied commented Mar 3, 2021 • edited Loading

gnufied commented Apr 28, 2021 • edited Loading

gnufied commented Apr 29, 2021

saad-ali commented May 12, 2021

gnufied commented May 21, 2021

saad-ali commented May 24, 2021

ddebroy commented May 26, 2021

humblec commented May 27, 2021

jdef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied May 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented May 27, 2021

humblec commented May 27, 2021

andyzhangx commented Jun 2, 2021

gnufied commented Jun 2, 2021

gnufied commented Jun 4, 2021

bswartz left a comment

Choose a reason for hiding this comment

saad-ali commented Jun 4, 2021

gnufied commented Jan 20, 2021 •

edited

Loading

gnufied Feb 11, 2021 •

edited

Loading

gnufied Feb 12, 2021 •

edited

Loading

gnufied Feb 19, 2021 •

edited

Loading

gnufied Feb 22, 2021 •

edited

Loading

gnufied Mar 3, 2021 •

edited

Loading

gnufied Feb 23, 2021 •

edited

Loading

gnufied commented Mar 3, 2021 •

edited

Loading

gnufied commented Mar 3, 2021 •

edited

Loading

gnufied commented Apr 28, 2021 •

edited

Loading

gnufied May 27, 2021 •

edited

Loading

humblec commented Jun 5, 2021 •

edited

Loading