Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple pool storage classes informed by topologyKeys #559

Closed
mmgaggle opened this issue Aug 16, 2019 · 10 comments
Closed

Multiple pool storage classes informed by topologyKeys #559

mmgaggle opened this issue Aug 16, 2019 · 10 comments
Labels
wontfix This will not be worked on

Comments

@mmgaggle
Copy link
Member

mmgaggle commented Aug 16, 2019

Describe the feature you'd like to have

Ability to have multiple pools per storage class, and use topologyKeys from the CSINode to inform pool selection.

What is the value to the end user? (why is it a priority?)

This would allow for the construction of a storage class that behaves in a similar fashion as Amazon EBS, Google Persistent Disk, and Azure Disk drivers provide. These provisioners do not expose a storage-class per failure-domain, end users should be able to provide a similar experience. As it currently stands, a pool and storage class would need to be created for each failure-domain and the end user need to select from them. By extension the pod would need to follow the PV (VOLUME_ACCESSIBILITY_CONSTRAINTS), instead of the PV being created where the pod is scheduled. This is particularly problematic if node placement rules conflict with the chosen storage class.

How will we know we have a good solution? (acceptance criteria)

Users will be able to request a PVC from a single named storage-class, and have the PV procured from a pool based on the CSINode topologyKeys.

Additional context

Proposed storageclass.yaml (for block)

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: standard-rbd-a
  namespace: rook-ceph
spec:
  crushRoot: us-west-2a
  failureDomain: host
  replicated:
    size: 3
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: standard-rbd-b
  namespace: rook-ceph
spec:
  crushRoot: us-west-2b
  failureDomain: host
  replicated:
    size: 3
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  crushRoot: us-west-2c
  name: standard-rbd-c
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 3
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: standard-rbd
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    clusterID: rook-ceph
    imageFormat: "2"
    imageFeatures: layering
    csi.storage.k8s.io/provisioner-secret-name: rook-ceph-csi
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
    csi.storage.k8s.io/node-stage-secret-name: rook-ceph-csi
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
    csi.storage.k8s.io/fstype: xfs
    zones:
      us-west-2a:
        pool: standard-rbd-a
      us-west-2b:
        pool: standard-rbd-b
      us-west-2c:
        pool: standard-rbd-c
reclaimPolicy: Delete
@obnoxxx
Copy link

obnoxxx commented Oct 30, 2019

@mmgaggle writes

These provisioners do not expose a storage-class per failure-domain ...

I am not sure I follow: We (ceph/rook/ceph-csi) do distribute our pools across failure domains and usually don't have failure-domain specific storage classes (unless the admin explicitly sets it up like this). So what is this feature request actually intending to achieve?

@dillaman
Copy link

@obnoxxx I thought the goal w/ this was topology aware provisioning. If you want to have a failure domain of zone A (and not spread your IOs across multiple failure domain zones A+B+C and incur additional inter-zone networking costs), right now you would need to have multiple storage classes to represent each possible failure domain. Instead, you should be able to have a single storage class that, upon provisioning of a PV, the CSI would map the zone to a backing RBD pool whose failure domain matches the zone.

@mykaul
Copy link
Contributor

mykaul commented Oct 31, 2019

@dillaman - who created those different RBD pools in the 1st place?

@dillaman
Copy link

@mykaul Rook would both create the per AZ pools (and inter-AZ pool) along w/ the two storage classes -- one for locally-redundant storage and the other for zone-redundant storage (to borrow Azure terminology)

@obnoxxx
Copy link

obnoxxx commented Oct 31, 2019

@dillaman well, to my understanding, we're not usually creating one pool/SC per zone but spreading one pool across failure domains as much as possible for availability. Of course we can work on affinity, but the topology aware provisioning feature of kubernetes sounds orthogonal to what we're doing.

Of course you could do the multiple pools thing (to spare costs etc), doing a 90 degree turn from what we're currently doing. In this case it does make sense. But what I'm never to sure about: what happens if the pod that's using this PVC is rescheduled on a different zone?...

Well, in the end, ceph-csi doesn't set up the pools, so it is fair to say that it should be able to properly deal with any situation.

@mmgaggle
Copy link
Member Author

mmgaggle commented Nov 2, 2019

The idea is to both have a ‘regional’ storage class that is backed by a pool spread across multiple AZs (default crush root, zone failure domain). We currently do this downstream.

A proposed new ability is to have per AZ pools that map to a single ‘standard’/‘zonal’ storage class.

Just like with EBS, if a pod is defined with an affinity the PVC will be satisfied by a dynamically provisioned PV in the affine zone (WaitForFirstConsumer). If the pod moves, then that affinity will ensure the pod is in the same zone as the PVC. If the pod does not have a affinity definition, then the pod will follow the PV (round robin dynamic provisioning) due to the VOLUME_ACCESSIBILITY_CONSTRAINTS.

ShyamsundarR added a commit to ShyamsundarR/ceph-csi that referenced this issue Dec 19, 2019
Proposal document addressing the following issues,
Updates: ceph#440, ceph#559

Signed-off-by: ShyamsundarR <srangana@redhat.com>
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Mar 13, 2020

Adding here for reference. a comment by @ShyamsundarR on https://github.com/ceph/ceph-csi/pull/760/files#r368081823

The changes to the StorageClass as specified in #559 will not work, as the StorageClass parameters are of type map[string]string. Thus passing a more complex structure in here will not work (thanks to @JohnStrunk for pointing it out).

Instead a single key:value pair is proposed as below, that has a JSON structue in the value that can be parsed and used by the plugin.

New StorageClass parameter to detail pools and their topology is as follows,

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: csi-rbd-sc
parameters:
  topologyConstrainedPools: |
      "[{"poolName":"pool0",
         "domainSegments":[
           {"domainLabel":"region","value":"vagrant"},
           {"domainLabel":"zone","value":"zone0"}]},
       {"poolName":"pool1",
         "domainSegments":[
           {"domainLabel":"region","value":"vagrant"},
           {"domainLabel":"zone","value":"zone1"]}
       ]"

ceph-csi-bot pushed a commit to ShyamsundarR/ceph-csi that referenced this issue Aug 17, 2020
Proposal document addressing the following issues,
Updates: ceph#440, ceph#559

Signed-off-by: ShyamsundarR <srangana@redhat.com>
@humblec
Copy link
Collaborator

humblec commented Sep 30, 2020

We have topology aware support with current code, however we have to revisit/improve it and move to beta support. @ShyamsundarR can give more details on this.

@stale
Copy link

stale bot commented Jul 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Jul 21, 2021
@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants