Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spec change for volume mount group #468

Merged

Conversation

gnufied
Copy link
Contributor

@gnufied gnufied commented Jan 20, 2021

Adds an alpha field to NodeStage and NodePublish which would allow CO to supply group identifier of workload, so as volume can have group permissions when published to a node.

Fixes #449

@gnufied gnufied force-pushed the add-gid-publish branch 2 times, most recently from 3c23d41 to e2f95a6 Compare January 27, 2021 16:19
Copy link
Contributor

@bswartz bswartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there should be a clarification near the top or bottom of a spec that explains that when we say "all the files and directories are readable and writable" that that is a continuous expectation for the life of the publish, not a point in time expectation at publish time. That clarification would rule out the chown/chmod based tricks.

csi.proto Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated
// It is expected that SP SHOULD use provided volume_mount_group
// for mounting the volume and volume should remain readable and
// writable by workloads associated with volume_mount_group until
// corresponding NodeUnstageVolume or NodePublishVolume is called.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bswartz how does this sound to indicate lifetime of group based permissions?

csi.proto Outdated
// provided volume_mount_group and all files and directories
// within the volume are readable and writable by the provided
// volume_mount_group.
// The value of volume_mount_group should be group_id or group name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to be clear whether COs may specify a name or an ID here. If we can avoid specifically mentioning UNIX and Windows, and use language that applies equally well to both, that would be ideal.

For example, we could say it must be the ID of the group. On UNIX that would imply decimal-formatted GID string. On Windows that would imply a SID string. If we say that names are also acceptable, then I would expect that to be true on both Wndows and UNIX.

I want to be unambiguous what the CO is required to pass down here, and I want it to be unambiguous what the SP need to be prepared to deal with here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded as we discussed offline. Now this reads:

// The value of volume_mount_group should be group
// identifier (as determined by underlying operating system)
// which would be associated with workload that uses the volume.

csi.proto Outdated
// within the volume are readable and writable by the provided
// volume_mount_group.
// If NodeStageVolume was previously called with volume_mount_group
// CO MUST ensure that NodePublishVolume uses the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we said here this is a SHOULD not a MUST. I think the semantics we want is that, if the CO does pass the same volume_mount_group as it did to stage, then the SP MUST NOT fail for reason related to the group. If the CO passes a different group, then the operation MAY fail and that should be communicated in the error code.

Copy link
Contributor Author

@gnufied gnufied Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought - if NodePublish gets called with different value than NodeStage - it just isn't going to work for ANY storage provider because volume was already staged with a different group identifier, so there is no point in using SHOULD language there.

The place where we wanted to use SHOULD or MAY is - if SP did not had NodeStage but two or more NodePublish was called with different volume_mount_group for same volume (on same node) but different target paths. Such a case MAY or MAY not be supported by SP and hence I covered this case with an error code:

| Volume already published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the volume with specified volume_id was already node published using different volume_mount_group on this node and can not be node published. | Caller MUST ensure that NodePublishVolume is called with same volume_mount_group on all target paths. |

csi.proto Outdated
// Indicates that Node service supports mounting volumes
// with provided volume group identifier during node stage
// or node publish RPC calls.
// It is expected that SP SHOULD use provided volume_mount_group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to say here that if the capability is asserted, and the CO passes a non-nil group, then it MUST be respected. I don't think SHOULD is the right language here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded.

@gnufied
Copy link
Contributor Author

gnufied commented Feb 11, 2021

cc @jdef @saad-ali @msau42

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting for morning coffee to kick in, but I've taken a stab at some initial feedback anyway.

spec.md Outdated

// If SP has VOLUME_MOUNT_GROUP node capability and CO provides
// this field then SP MUST ensure that volume is mounted with
// provided volume_mount_group and all files and directories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that the wording "all files and directories within the volume are readable and writable by ..." leaks implementation detail, thinking that it should be removed. In terms of guarantees, the spec should aim no higher than what's provided by mount (and the OS driver implementation). Driver implementations of mount are already a black box and it feels odd to layer additional requirements on top of a underspecified interface. Furthermore, just because k8s wants to implement a recursive chown doesn't mean that it's appropriate for all CSI driver implementations, or all CO implementations/deployments.

Also, removing the recursive chown wording makes this more interesting for block volumes. Perhaps ownership means something for those volumes too? And if so, do we really want to tie this to "mount", or maybe we need a different word if "mount" is only for fs-type volumes (and vs block-type volumes).

Copy link
Contributor Author

@gnufied gnufied Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, just because k8s wants to implement a recursive chown doesn't mean that it's appropriate for all CSI driver implementations, or all CO implementations/deployments.

Just to clarify - if a driver has this capability, k8s WILL not chown/chmod files belonging to this volume. In stead the plan is - k8s will assume SP has done its job and volume has right permissions after nodestage/nodepublish. Also - the reason we need this as a mount flag is mainly because in many cases k8s/CO can not recursively chown files (for example - cifs mount without unix extensions).

'm concerned that the wording "all files and directories within the volume are readable and writable by ..." leaks implementation detail, thinking that it should be removed.

Yes the wording does gets into implementation details. My reasoning as originally discussed with @bswartz was - we do not want SPs to use this field to recursively chown/chmod files at all. Because chown/chmod is very much CO specific thing and we actively wanted to prevent SPs from doing that. When using chown for example - newly written files on the volume may not be group writable and hence would violate the CSI spec. But I can see - how this leaks into mount implementation details too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not happy with this wording:

... then SP MUST ensure that volume is mounted with provided volume_mount_group and all files and directories within the volume are readable and writable by the provided volume_mount_group.

(a) what does "mounted with provided volume_mount_group" actually mean?
(b) all files + dirs readable/writable sounds like a recursive chown to me, unless this is a validation check? which doesn't really make sense in this context.

I'd like to take a stab at some alt wording, which attempts to decrease the scope of what CSI is willing to promise:

"... then the SP MUST ensure that the volume_mount_group parameter is passed as the group (primary group, if possible) identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, new file and/or directory entries written to the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are both readable and writable by said mount group identifier."

.. which tries to get away from saying that everything on a preexisting volume will be read/write-able (which is gets into recursive chown territory), and also tries to cut to the chase of the requirement that I think i'm hearing - that newly written blobs are the concern here, and they should (by default) be labeled with the group ID presented here.

also, if a workload chown's files on the volume such that it can no longer read them .. well, tough luck. CSI is not a data plane API. it's up to the SP implementation and/or OS to restrict such operations if that behavior is undesireable. CSI should not promise anything w/ respect to what I/O operations a workload is allowed to execute (because CSI is a control plane API).

Copy link
Contributor Author

@gnufied gnufied Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think updated wording looks fine. The only question I have is - why primary group if possible? k8s at least will almost never pass primary group of the workload.

Copy link
Contributor Author

@gnufied gnufied Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking some more about it and I agree that your updated wording meets the requirement and everything, but one more thing which has been bothering me somewhat is - we don't really need to call out permissions of new files/directories. The wording does imply that - chown/chmod may not be enough but what we really want SPs to use is - mount options and hoping that mount operation with provided group identifier is enough to apply necessary permissions.

So I am thinking two alternative wordings:

"...then the SP MUST ensure that the volume_mount_group parameter is passed as the group identifier to the underlying OS mount system call,
with the understanding that this would result in volume being readable(if published as readonly) or both readable and writable by said mount group identifier. The set of available mount call parameters and/or mount implementations may vary across operating systems." --> This covers readonly interaction as well.

Or something like:

"... then the SP MUST ensure that the volume_mount_group parameter is passed as the group identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, it is expected file and/or directory entries on the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are either readable(in case of readonly volumes) or both readable and writable by said mount group identifier." --> covers interaction with readonly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My wording says the permission labeling (ala the group option) targets new file(s)/dir(s); your wording doesn't include the "new", which in my mind translates to "recursive chown".

If you did not intend to drop "new" from the wording, please add it back. Because maybe you're only trying to amend my wording to account for (a) readable files in the read-only case, or else; (b) read/write-able files otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with accounting for "new" files is - are we being too specific? A SP may violate this and still satisfy what CO expects. But to be clear we do not have that problem at least with Azurefile/CIFS-Samba mounts. It works exactly like how you worded and it should be fine. So may be I am over-thinking this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright - I have incorporated this wording. I thought again about readonly case and I do not think it is important to call that out here again, because readonly can be enforced in different ways than by combining with gid mount options.

spec.md Outdated

// If SP has VOLUME_MOUNT_GROUP node capability and CO provides
// this field then SP MUST ensure that volume is mounted with
// provided volume_mount_group and all files and directories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re: recursive chown requirement

spec.md Outdated
@@ -2260,6 +2288,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability` or `readonly` flag. | Caller MUST fix the arguments before retrying. |
| Exceeds capabilities | 9 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
| Staging target path not set | 9 FAILED_PRECONDITION | Indicates that `STAGE_UNSTAGE_VOLUME` capability is set but no `staging_target_path` was set. | Caller MUST make sure call to `NodeStageVolume` is made and returns success before retrying with valid `staging_target_path`. |
| Volume staged with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that volume with specified `volume_id` was node staged using different `volume_mount_group` on this node and hence can not be node published. | Caller MUST make sure that `NodePublishVolume` is called with same `volume_mount_group` which was used in `NodeStageVolume`. |
| Volume already published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the volume with specified `volume_id` was already node published using different `volume_mount_group` on this node and can not be node published. | Caller MUST ensure that `NodePublishVolume` is called with same `volume_mount_group` on all target paths. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) if this is case is to cover the scenario when node-stage is unimplemented, please say so explicitly.
(b) unclear whether this behavior is "optional", meaning that if some implementations support multiple publish calls w/ different groups are they still required to throw this error? from earlier review comments in the spec it seems like this was to be optionally supported, but the language here does not make that clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@gnufied gnufied Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have gone back little bit on this one and updated the wording to be same for both nodestage->nodepublish and nodepublish->nodepublish. Originally, the wording said CO MUST use same volume_mount_group identifier with NodePublish which was used during NodeStage. I have relaxed with following wording:

  // If Plugin does not support NodePublishVolume with different
  // volume_mount_group than the one used during NodeStageVolume
  // then Plugin MAY return FAILED_PRECONDITION error.
  // Similarly if SP does not support NodePublishVolume of same volume
  // on same node but with different volume_mount_group it MAY return
  // FAILED_PRECONDITION error.

The idea is - some plugins may not perform actual mount on NodeStage and they may only do some housekeeping during staging and hence if underlying driver supports it, multiple NodePublish of same volume can use different gid than what was specified on NodeStage. It did not make sense to penalize a Plugin for implementing NodeStage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @bswartz ^

spec.md Outdated
// identifier (as determined by underlying operating system)
// which would be associated with workload that uses the volume.
// This is an OPTIONAL field.
string volume_mount_group = 9;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the next thing that someone will ask for is "mount user". and then someone else will want "mount SELinux labels". and then "extended ACLs". and so on.

at the risk of over-engineering a solution here, i'm wondering if it might make sense to have a top-level type, e.g. VolumeOwnership to capture the intent:

message VolumeOwnership {
  string primary_group = 1;
}

and then the field in this message becomes:

  VolumeOwnership volume_mount_owner = 9;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern about mount_user is valid. We may need that in future. SELinux however is solvable without any CSI spec change - see - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/1710-selinux-relabeling/README.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the reason we need this as explicit field in nodestage/nodepublish is because - unlike selinux mount options which has a standard key/value pair (like mount -o context=<>), the group mount option has no such standard format. So CO does not know how to format it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we conclude on this? I think it would be nice to leave room for future options.

Copy link
Contributor Author

@gnufied gnufied Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so I was trying to update the spec for moving volume_mount_group to its own message type, but there are some issues to think about.

Currently the spec says, if volume was staged with one group identifier, it can't be node published with different group identifier on same node. This makes sense for group identifier but what if in future it does not make sense for uid or something else?

Using SELinux as an exmaple -if we were to add selinux_context to VolumeOwnership message then, it would be an error to NodePublish a volume with different selinux contexts on same node, it just won't work. But it does work for group identifier for many CIFS implementations at least.

So bringing all of ownership related options within one message can also complicate such future changes. What if behaviour of those ownership related flags are different and their interaction with CO and workload is different?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the restriction have to apply at the higher level VolumeOwnership or can it apply on individual fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect restrictions to apply to individual fields but error handling and recovery has to be appropriately adjusted so as it can apply to individual fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble is basically - appropriately wording VolumeOwnership type which can apply to both NodeStage and NodePublish and then defining errors and recovery for both stage and publish in same place and type and then ensuring that error handling and recovery appropriately applies to all the fields and is easy to extend in future.

This IMO leads to excessively long definition of type volume_mount_group (because among other things now we have to define both publish and stage time behaviour) for example.

It is not impossible but I am not sure if this is worth the tradeoff. @jdef @msau42 what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider other naming:

message VolumePrincipal {
  string group = 1;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can certainly do this. Did we want this as part of VolumeCapability field?

@bswartz
Copy link
Contributor

bswartz commented Feb 17, 2021

@jdef My suggestion was not to expose implementation detail, but to highlight the fact that the ownership should be a continuous thing over the life of the publish, and not a discrete one time action at publish time. We shouldn't care how the plugin achieves that, but my goal was to rule out things like a chown.

@bswartz
Copy link
Contributor

bswartz commented Feb 17, 2021

I am happy with this wording. Any other outstanding changes before we approve?

@gnufied
Copy link
Contributor Author

gnufied commented Feb 17, 2021

@bswartz I added additional language in NodePublishVolume RPC call that clarifies concerns @jdef raised here - #468 (comment)

spec.md Outdated

// If SP has VOLUME_MOUNT_GROUP node capability and CO provides
// this field then SP MUST ensure that volume is mounted with
// provided volume_mount_group and all files and directories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not happy with this wording:

... then SP MUST ensure that volume is mounted with provided volume_mount_group and all files and directories within the volume are readable and writable by the provided volume_mount_group.

(a) what does "mounted with provided volume_mount_group" actually mean?
(b) all files + dirs readable/writable sounds like a recursive chown to me, unless this is a validation check? which doesn't really make sense in this context.

I'd like to take a stab at some alt wording, which attempts to decrease the scope of what CSI is willing to promise:

"... then the SP MUST ensure that the volume_mount_group parameter is passed as the group (primary group, if possible) identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, new file and/or directory entries written to the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are both readable and writable by said mount group identifier."

.. which tries to get away from saying that everything on a preexisting volume will be read/write-able (which is gets into recursive chown territory), and also tries to cut to the chase of the requirement that I think i'm hearing - that newly written blobs are the concern here, and they should (by default) be labeled with the group ID presented here.

also, if a workload chown's files on the volume such that it can no longer read them .. well, tough luck. CSI is not a data plane API. it's up to the SP implementation and/or OS to restrict such operations if that behavior is undesireable. CSI should not promise anything w/ respect to what I/O operations a workload is allowed to execute (because CSI is a control plane API).

@jdef
Copy link
Member

jdef commented Feb 19, 2021 via email

@jdef
Copy link
Member

jdef commented Mar 3, 2021 via email

@gnufied
Copy link
Contributor Author

gnufied commented Mar 3, 2021

@jdef thanks. Do you think we can get this in for next CSI release (i.e 1.4) or it is better suited for 1.5 ? I need to make some k8s changes and if we can't make this in 1.4 then I would rather avoid extra work.

@gnufied
Copy link
Contributor Author

gnufied commented Mar 3, 2021

Posting an update since this PR will likely miss the cut - I am waiting for some explanation from @jdef on #468 (comment) . We could make the value of volume_mount_group more restricted but it would prevent use cases where a driver may support node publishing same volume multiple times with different gid (i.e if I understood james correctly).

@gnufied
Copy link
Contributor Author

gnufied commented Apr 28, 2021

@jdef so one thing we can do to sacrifice some of the flexibility and reduce two different errors in one is - enforce that for same volume_id, both node stage and node publish and multiple instances of node publish on same node, has to always use same volume_mount_group. That is - make it entirely unsupported to specify different volume_mount_group for same volume, if volume was previously staged or published with different volume_mount_group.

This will allow us to squash two different errors in one:

 Volume already published or staged with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the 
volume with specified `volume_id` was already node published or staged using different `volume_mount_group` on this node and 
can not be node published. | Caller MUST ensure that `NodePublishVolume` is called with same `volume_mount_group` on all 
target paths. |

@gnufied gnufied force-pushed the add-gid-publish branch 3 times, most recently from b3f3c6f to 98893c5 Compare April 29, 2021 18:47
@gnufied
Copy link
Contributor Author

gnufied commented Apr 29, 2021

@jdef @saad-ali alright the updated scope is more narrower now. The spec now says all instances of NodePublishVolume must use same gid value as previous instances of NodePublish or NodeStage RPC calls for same volume. This also simplifies error handling. PTAL.

@saad-ali
Copy link
Member

Thanks @gnufied

Can you provide examples of a couple of CSI Driver and how they would implement this both for linux and for windows? e.g. "AzureFile will convert this in to a mount --foo in linux and (?) in windows and CSI driver ... will do ..."

Sorry if this was mentioned else where, just want to wrap my head around the change.

@gnufied
Copy link
Contributor Author

gnufied commented May 21, 2021

On Linux Azurefile CSI driver would convert the supplied volume_mount_group to a gid during NodeStage call:

mount -o gid=volume_mount_group

On Windows currently at least in k8s, there is no implementation of using fsgroups/gids for pods. But when I spoke with @ddebroy on slack, he suggested that supplied group_id can be a SID that can be used to lock down volume for specific security identifier (https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/security-identifiers-in-windows).

@saad-ali
Copy link
Member

I spoke with @gnufied offline about this feature. He explained:

The need for fsgroup in CSI rpc call is because of certain volume types, the only way to change volume permissions is during mount via mount option. you can't recursively chown/chmod these files because azurefile for example, does not support unix extensions.
we could carry hardcode just for azurefile CSI driver in kubelet. but it will be super ugly
and it is not just azurefile. there will be other CIFS/samba implementation that will need something similar

In general I worry about adding any permissions logic to CSI for a couple of reasons: 1) I question if manipulating file permissions actually adds any real security, and 2) ensuring file permissions are added in a way that will work across all OS/filesystem combos will be hard.

For my first concern, @gnufied explained that many apps won't work with world readable files, and so "meeting the user where they are" is one reason we need fsgroup functionality, and the other argument is security* (with an asterix because you only get security benefits if you also limit the ranges pods can have, and/or dynamically assign the ranges).

For my second concern, @gnufied explained this approach should work for both windows and linux at least for azurefile.

Given how important this is for AzureFile, I'm keen to approve this. @jdef @ddebroy any objections?

@ddebroy
Copy link
Contributor

ddebroy commented May 26, 2021

LGTM. No objections based on relevance of this for Azurefile as well as any other CSI plugin that leverages SMB.

@humblec
Copy link
Contributor

humblec commented May 27, 2021

One thing which does not give me clarity here is the reference to VOLUME_MOUNT_GROUP and what exactly it refers to, when we say CO passes VOLUME_MOUNT_GROUP , are we referring to GID ( effective) or Supplemental GID here (https://www.openshift.com/blog/a-guide-to-openshift-and-uids) ? We have seen many issues in the past due to permission denied errors on the volume from app workloads especially for RWX volumes. One way to solve this was, passing the GID via PV annotation which apparently get added to SGID group array by the CO. imo, the spec has to make it clear that, in which context of GID ( effective/supplemental) VOLUME_MOUNT_GROUP supposed to operate from CO and SP level? @gnufied @saad-ali @bswartz @jdef

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for following up on this. i think this proposal needs a bit more work, left some feedback. i'm a bit concerned that the OP still has questions about integration/e2e flows between CO and SP. i'm wondering if we'd benefit from some kind thought-experiment-in-the-form-of-docs that walks through a hypothetical CO/SP integration, calling out requisite configuration and how that manifests at runtime via CSI calls, and resulting system state.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated
@@ -2315,6 +2348,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability` or `readonly` flag. | Caller MUST fix the arguments before retrying. |
| Exceeds capabilities | 9 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
| Staging target path not set | 9 FAILED_PRECONDITION | Indicates that `STAGE_UNSTAGE_VOLUME` capability is set but no `staging_target_path` was set. | Caller MUST make sure call to `NodeStageVolume` is made and returns success before retrying with valid `staging_target_path`. |
| Volume staged or published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that volume with specified `volume_id` was node staged or published using different `volume_mount_group` on this node and hence can not be node published. | Caller MUST make sure that all `NodePublishVolume` calls specify same value of `volume_mount_group`. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we considered how this feature works in light of the new single_node_multi_writer functionality being proposed?

Copy link
Contributor Author

@gnufied gnufied May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the definition of volume_mount_group is more strict and requires that all node-publish RPC calls use same value of volume_mount_group. So basically - if T1!=T2 and P1!=P2 (where Pn includes volume_mount_group) then - for single_node_multi_writer volumes, node-publish should fail.

Having said that - we can relax the wording once #476 merges or if this merges first then we can update #476. The relaxed wording will be - if SP supports both SINGLE_NODE_MULTI_WRITER and VOLUME_MOUNT_GROUP capability then it MUST allow NodePublishing same volume multiple times on same node with different volume_mount_group. The cifs implementations that I know already support that (you can mount same volume multiple times on the node with different gid).

The language here is however more strict - because we wanted to start with something strict and then relax the wording if needed (an earlier version of design did allow aformentioned flow).

spec.md Outdated
@@ -2134,6 +2134,19 @@ message NodeStageVolumeRequest {
// This field is OPTIONAL and MUST match the volume_context of the
// volume identified by `volume_id`.
map<string, string> volume_context = 6;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field is declared both for node-stage, and for node-publish. why not declare it ONCE as part of VolumeCapability ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, if this is only for mount .. then perhaps as part of MountVolume?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what does this field mean in the context of block vols?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to use mount flags defined in volume capability (inside MountVolume) because:

  1. The flags defined inside volume capability is static and part of CreateVolume response - whereas this mount option depends on workload that is trying to use the volume and could change.
  2. Say even if we pretend that volume capability can be modified at any point in time and is not tied to Volume object. How does CO know it should supply -o gid=<> mount flag during NodeStage or NodePublish?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm - I think your response meant the flag not be included as mount_flags inside MountVolume type - but rather a separate field in MountVolume type? I misunderstood it to be the former.

Yes having a separate volume_mount_group field inside MountVolume could work. It would have same semantic meaning as when it is defined outside (as current PR). The main concern though is - it would perhaps alter meaning of volume capability. But if that is acceptable trade-off I can make that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved this field to VolumeCapability with everything else being the same. I think this also allows us to cleanup the error handling and now error handling simply is determined by based on the criteria if SP supports publishing a volume with different capabilities on same node. It also addresses interaction of this PR with #476

spec.md Outdated
// identifier (as determined by underlying operating system)
// which would be associated with workload that uses the volume.
// This is an OPTIONAL field.
string volume_mount_group = 9;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider other naming:

message VolumePrincipal {
  string group = 1;
}

@gnufied
Copy link
Contributor Author

gnufied commented May 27, 2021

thanks for following up on this. i think this proposal needs a bit more work, left some feedback. i'm a bit concerned that the OP still has questions about integration/e2e flows between CO and SP.

Trying to address both @humblec and @jdef concerns here - there is a k8s KEP which describes how this will be used - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2317-fsgroup-on-mount/README.md . Beyond that, I have written some code that uses this new field and integration with CO and SP is pretty straightforward.

@humblec the supplemental groups and fsgroups are CO's internal implementation details that do not necessarily bubble up to CSI spec. But for completeness sake, I will try and clarify here - in k8s supplemental groups are additional groups that are added to a pod (not the primary GID). They can be used to match gid of the underlying volume, so as k8s does not have to perform any permission/ownership change if supplemental group already match volume's underlying group ownership/permissions. Supplemental groups do not result in any permission change of the volume and hence is not useful with problem we are trying to solve.

@humblec
Copy link
Contributor

humblec commented May 27, 2021

thanks for following up on this. i think this proposal needs a bit more work, left some feedback. i'm a bit concerned that the OP still has questions about integration/e2e flows between CO and SP.

Trying to address both @humblec and @jdef concerns here - there is a k8s KEP which describes how this will be used - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2317-fsgroup-on-mount/README.md . Beyond that, I have written some code that uses this new field and integration with CO and SP is pretty straightforward.

@humblec the supplemental groups and fsgroups are CO's internal implementation details that do not necessarily bubble up to CSI spec.

Sure, supplemental group is a general concept (*nix) too, so brought it up.

But for completeness sake, I will try and clarify here - in k8s supplemental groups are additional groups that are added to a pod (not the primary GID). They can be used to match gid of the underlying volume, so as k8s does not have to perform any permission/ownership change if supplemental group already match volume's underlying group ownership/permissions.

True, made use of that heavily in GlusterFS plugin under kube CO to overcome permission denied error and such.

Supplemental groups do not result in any permission change of the volume and hence is not useful with problem we are trying to solve.

In short, the VOLUME_MOUNT_GROUP is the effective GID which CO/SP can make use of , also supplemental Group workflow is unaffected/untouched. Thanks for clarifying @gnufied 👍

@andyzhangx
Copy link
Contributor

@gnufied with this interface change, and also k/k code change, the following fsGroup value would be passed to NodeStageVolume and NodePublishVolume, right?

spec:
  securityContext:
    fsGroup: 2000

@gnufied
Copy link
Contributor Author

gnufied commented Jun 2, 2021

@andyzhangx yes that is accurate.

@gnufied
Copy link
Contributor Author

gnufied commented Jun 4, 2021

@jdef @saad-ali @bswartz @msau42 please take another look. As discussed #468 (comment) , I have moved the new field to VolumeCapability . This will work for our use case and it simplifies some of the API choices we had to make.

Copy link
Contributor

@bswartz bswartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@saad-ali
Copy link
Member

saad-ali commented Jun 4, 2021

@jdef @humblec can you PTAL?

@saad-ali
Copy link
Member

saad-ali commented Jun 4, 2021

PTAL ASAP trying to get this merged and new release cut on Monday.

@humblec
Copy link
Contributor

humblec commented Jun 5, 2021

@gnufied with this interface change, and also k/k code change, the following fsGroup value would be passed to NodeStageVolume and NodePublishVolume, right?

spec:
  securityContext:
    fsGroup: 2000

@gnufied just to complete my understanding, if the volume is RWX ( multi node multi writer) and 2 different PODs scheduled on 2 different nodes by specifying different fsgroup values , Is SP supposed to respect these values at mount for this same underlying volume?

@humblec
Copy link
Contributor

humblec commented Jun 5, 2021

@jdef @humblec can you PTAL?

Sure @saad-ali, the change looks good to me !! 👍
I have one side query here (#468 (comment)), but should not be a blocker. Thanks !

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod nits

In the case of the SINGLE_NODE_MULTI_WRITER flow (another PR looking to land ASAP), is it legal to mount the same volume to multiple workloads on the same node w/ different volume_mount_group's ? If not, should it be? I have my doubts as to whether all SPs (or even OSs) can support this. May be worth calling out somewhere/somehow as an error case that CO will need to deal with.

csi.proto Show resolved Hide resolved
csi.proto Show resolved Hide resolved
csi.proto Show resolved Hide resolved
csi.proto Show resolved Hide resolved
@gnufied
Copy link
Contributor Author

gnufied commented Jun 7, 2021

@jdef the SINGLE_NODE_MULTI_WRITER PR has wording that says - a CO SHOULD not call NodePublishVolume with different volume_capability for same volume - https://github.com/container-storage-interface/spec/pull/476/files#diff-bc6661da34ecae62fbe724bb93fd69b91a7f81143f2683a81163231de7e3b545R2251

I can add more explicit wording - but I thought that language covers this limitation well enough.

@humblec Generally speaking yes it is possible that CO will call node-stage/node-publish with different volume_mount_group on different nodes for same RWX volume. Plug-in that need this feature as far as I know support that behaviour. If it isn't supported by SP - it may return Exceeds capabilities.

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

LGTM. No objections based on relevance of this for Azurefile as well as any other CSI plugin that leverages SMB.

LGTM mod nits

Sure @saad-ali, the change looks good to me !! 👍

Thanks folks! Will go ahead and merge this.

csi.proto Outdated Show resolved Hide resolved
Apply suggestions from code review
Co-authored-by: jdef <2348332+jdef@users.noreply.github.com>
@saad-ali
Copy link
Member

saad-ali commented Jun 7, 2021

Thanks for the rebase. Merging now.

@saad-ali saad-ali merged commit f1ae024 into container-storage-interface:master Jun 7, 2021
@saad-ali saad-ali mentioned this pull request Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting gid as mount option
8 participants