[Proposal] Topology aware provisioning support for Ceph-CSI #760

ShyamsundarR · 2019-12-19T19:20:08Z

Proposal document addressing the following issues,
Updates: #440, #559

Signed-off-by: ShyamsundarR srangana@redhat.com

ShyamsundarR · 2019-12-19T19:25:46Z

Requesting reviews and discussion from: @travisn @JohnStrunk @mmgaggle @dillaman @batrick @humblec @Madhu-1 @nixpanic

dillaman · 2019-12-19T20:29:28Z

docs/design/proposals/TopologySupport.md

+
+The CSI nodeplugins also do not have any specific domain knowledge to present themselves, and would hence rely on the domain information of the cluster that it runs on.
+
+The proposal is to, feed CSI nodeplugins a list of labels that it can read from the COs to determine its domain and return the same in it's response to a `NodeGetInfo` request.


Rook is already scheduling OSDs by zone (and it will soon need to create new RBD/CephFS pools tied to inter-zone vs intra-zone CRUSH rules), so I would think it could also create stateful sets for at least the node plugins w/ the appropriate zone tag passed in as a parameter to the container. For the controller, it would just need to provide all available zones (which again Rook should know about).

Controller need not know about the domains, it will be a list of domains as advertised by the nodeplugins.

I am not sure how Rook would pass domain labels specific to the node as a parameter, based on where the stateful set instance is started. As these would be different for each node in the cluster. @travisn any ideas?

The current proposal, that states we can pass a list of labels that needs to be read from the node, also can be revised to make it an InitContainer (as propsed by @JohnStrunk). This can then read labels of the node, and/or other labels/domain-data of interest from other sources (say Rook based data in the cluster) and pass them via a shared config map to the nodeplugin container.

It could create multiple stateful sets -- one per zone. It would then use filters to lock the stateful set pods to the corresponding zone.

It could create multiple stateful sets -- one per zone. It would then use filters to lock the stateful set pods to the corresponding zone.

Ok, why would we do this? Instead of using the node labels and an Init container? I see Rook also leveraging domain information from kubernetes, so finding it difficult to justify why we need Rook or the admin (in the absence of rook) feed this to CSI.

In the extraneous case of additional domain hierarchy, than what is present as labels on the node is to be used, the questions would be where do these come from and how does Rook know which node belongs to the said additional domain. (and even in this case the Init container can leverage the same methods to determine and pass the data to CSI)

Rook sets up the CRUSH map based on the node labels described here. But Rook isn't in the picture when the pod is provisioning the storage, it's just the CSI driver. The CSI driver should be able to see what node it's on and then look up the topology labels for the node.

dillaman · 2019-12-19T20:34:26Z

docs/design/proposals/TopologySupport.md

+
+Further, the CSI journal that is maintained as an OMap within a **single** pool needs to continue even when the volumes are allocated from different pools. As, we need a single pool that can hold information about the various CSI volumes to maintain idempotent responses to various CSI RPCs.
+
+The proposal hence is to also include a `pool` parameter in the StorageClass for RBD based volumes, where the CSI OMap journal would be maintained. This can be one of the subset of pools in the list of domain affinity based pools, or a more highly available cross-domain pool.


Are you only allowed to run a single (active) controller or will this new topology aware provisioning allow you to run an active controller per topology domain? If the former, it seems like you could effectively isolate the domains from each other and therefore you wouldn't need a centralized "global" inter-domain pool.

We can run multiple controllers, but routing of requests to these controllers would be at random by the CO and not follow any domain rules. Every controller instance that advertises the same driver name can get the provision request.

The alternate is to split these into their own StorageClass, and hence use the domain pool to maintain the journal, but that would defeat the transparent single StorageClass for users to use.

NOTE: The default pool to store the CSI journal need not necessarily be inter-zone, at least in this proposal, the user can choose one of the domain pools as the journal store, instead of having to create an inter-zone pool for just the journal needs (if the said journal pool is in accessible though, then provisioning would halt, but not stage/publish of other domain volumes, as the image OMap would reside on the same pool as the image).

Under AWZ, Azure, and GCP, are all PVs globally namespaced even if they are tied to a specific zone?

Under AWZ, Azure, and GCP, are all PVs globally namespaced even if they are tied to a specific zone?

No, in AWS and Azure a PV is namespaced to a zone, in GCP it can pick 2 zones if replication is enabled.

dillaman · 2019-12-19T20:35:38Z

docs/design/proposals/TopologySupport.md

+
+#### Alternatives
+
+- Use pool labels (if pools can carry user provided labels in Ceph), to filter pools based on labels and their values. Makes adding a pool for a newer domain easier than meddling with the StorageClass, and also keeps the StorageClass leaner.


While you can add a limited amount of application metadata to pools, I would be hesitant to suggest it over just tweaking the StorageClass

It does complicate the operation on the CSI end as well, as we need to search all pools and then determine ones that match the domain of interest (this also needs increased privileges?), on the other end of the spectrum it reduces what an admin needs to know to setup a StorageClass, we can pretty much drop pool requirements from the StorageClass if we could read csi specific labels for all the pools.

I am not initially taking this approach, but wanted to leave this an an alternative for discussion.

Please note that we prefer having one storage class, and multiple pools so we can assign pool to the pod after it is scheduled to AZ. The alternative (storage class per AZ affined pool) requires some level of static design/management by the admin which we prefer to avoid.

Please note that we prefer having one storage class, and multiple pools so we can assign pool to the pod after it is scheduled to AZ. The alternative (storage class per AZ affined pool) requires some level of static design/management by the admin which we prefer to avoid.

The proposal for the StorageClass is as per the above preference. IOW, a pool is selected from the StorageClass based on which AZ the pod needs to run in.

dillaman · 2019-12-19T20:37:06Z

docs/design/proposals/TopologySupport.md

+
+**NOTE:** For `volumeBindingMode` `WaitForFirstConsumer` the first domain specified in the `preferred` list of topologies would be chosen, as this is the domain where the pod would be scheduled.
+
+The proposal is to, choose a domain at random and provision the volume on the pool the domain belongs to. Further, return an empty `accessible_topology` in the response, as the volume can be accessed from any domain.


Perhaps just allocate it via the "default" inter-zone pool if you aren't tying it to a topology domain. Otherwise, we would most likely guess incorrectly.

Yes, that is an easier initial implementation, reduces confusion on where a request went to.

This is not feasible. IOW, defaulting to the default inter-zone pool. The reason is due to lack of information passed to the CSI plugin about the volumeBindingMode.

When a PV is requested using the SC that has topology based pools parameter defined, there is no way to distinguish what the volumeBindingMode is, and the requested topology requirement contains all values as returned by the various nodeplugins.

As a result, if a PV request comes in based on an SC where topology based pools are specified, we will default to the first topology segment that is passed in the preferred section of the request (or the first in the requisite section if preferred is empty).

dillaman · 2019-12-19T20:38:18Z

docs/design/proposals/TopologySupport.md

+
+As, kubernetes PVs are immutable and as discussed in "Topology details stored by kubernetes on the PV [9]" the returned topology is stored in the PV, when a domain becomes unavailable it would be required to maintain availability of the volumes.
+
+The thought around addressing this is, to dynamically inform surviving nodeplugins in other domains to take over the failed domain, as Ceph pools are essentially cross domain accessible. Further, in the event of the failed domain becoming available again, dynamically inform the nodeplugins to giveback the domain.


They wouldn't be --- if I specified to launch a "cheaper" PVC in us-west-2a to avoid inter-zone traffic costs, my PV wouldn't be reachable from us-west-2b.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

For the backend ceph storage here is what I am thinking,

If we have a read optimized pool, IOW primary OSDs fall into the same domain, then if said primary domain is inaccessible, we can still access the pool from one of the other domains. In such cases we (as in ceph-csi) can support accessing the cheaper PVC from the surviving domains, iff the user wants the same and will provide a takeover rule for action.

If we have a write optimized pool, the assumption is all OSDs belong to the same domain. As a result, domain unavailability cannot be tackled, as the pool also may become inaccessible, so there is not point in taking over such a domain by surviving domains. If even in this case the pool is accessible (for whatever reason that breaks the domain availability constraint between the CO and the ceph cluster), the user can add a takeover rule for the surviving CO domains.

From the CO perspective, as long as the nodeplugins in the surviving domains advertise the currently unavailable domain as supported, we should be fine.

The topology key that ceph-csi (or any CSI driver) would advertise would be like topology.<driver-name>/<domain> and so will not conflict with existing domain labels as understood by the CO.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

* For the backend ceph storage here is what I am thinking,

If I create a PVC and bind it to a specific zone (because I want cheaper storage), it would not be available from another zone if the zone fails. Your CRUSH rules would keep the PGs all within the same zone.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

* For the backend ceph storage here is what I am thinking,

If I create a PVC and bind it to a specific zone (because I want cheaper storage), it would not be available from another zone if the zone fails. Your CRUSH rules would keep the PGs all within the same zone.

Agreed, this applies if all the OSDs in the PGs were also local to the same domain and fail with the domain.

When this is not the case (say a pool that is read optimized to a domain, by restricting the primary OSD of each PG to belong to the same domain, whereas secondary OSDs are spread to different domains), the pool would be available even on that domain failure, right?

When this is not the case (say a pool that is read optimized to a domain, by restricting the primary OSD of each PG to belong to the same domain, whereas secondary OSDs are spread to different domains), the pool would be available even on that domain failure, right?

Just for clarity, there is no such thing as purely read-optimized. You can maybe create a read/write optimized pool, though, with some careful tweaking [1] -- but I'm not sure anyone has any immediate plans for that(?).

Plus, how would you then differentiate in the topology request for a purely zone X PG vs a read/write optimized pool for zone X? Would that just be a choice that you can either configure this as single zone or zone optimized topologies?

[1] https://www.osris.org/article/2019/03/01/ceph-osd-site-affinity

as of [1], yes we are. the plan is to do the following:
a. add a new command to the localpool ceph plugin (see PR: ceph/ceph#31625 ) that create pools with local read affinity.
b. then have Rook utilize this, when creating pools (by looking at the SC)
c. and then this, ie - add topology awareness to Ceph-CSI.

as of topology aware SC, I think that:
a. local only (single AZ) - will have a single failure domain semantic. so it is easy to distinct.
b. for other differences we can add a label, like pool.ceph.io/mode="ReadAffinity". which will also enhance readability of such object as it is more clear to viewer what this SC is about.

I also think a 'normal' (ie not read affined) pool that span 3 AZs can be just the same affined to any of that AZs without any performance degradations (or any other implications). So probably, the easiest solution is to treat all TopologyAware pools as if they require read affinity by default (ie default mode should be "ReadAffinity")

Just one small comment, while different storage class for AZ-local pool and inter-AZ pools is OK, I prefer that for inter-AZ read affined pools (we will have 3 of them) we will have only a single SC and CSI driver will select the correct pool according to the topology and the AZ the pod was initially scheduled to. I know that this breaks the 1:1 relation we have today between ceph pools and SCs but I understood there are other reasons to break this 1:1 relation and mak e it 1:n. If this can be done now it is preferred as we don't want to stick with this limitation which was done for the wrong reasons.

as of [1], yes we are. the plan is to do the following:
a. add a new command to the localpool ceph plugin (see PR: ceph/ceph#31625 ) that create pools with local read affinity.
b. then have Rook utilize this, when creating pools (by looking at the SC)
c. and then this, ie - add topology awareness to Ceph-CSI.

@dillaman The above should clarify why we need the domain takeover and giveback, for local read affinity.

as of topology aware SC, I think that:
a. local only (single AZ) - will have a single failure domain semantic. so it is easy to distinct.
b. for other differences we can add a label, like pool.ceph.io/mode="ReadAffinity". which will also enhance readability of such object as it is more clear to viewer what this SC is about.

I also think a 'normal' (ie not read affined) pool that span 3 AZs can be just the same affined to any of that AZs without any performance degradations (or any other implications). So probably, the easiest solution is to treat all TopologyAware pools as if they require read affinity by default (ie default mode should be "ReadAffinity")

@yuvalk I did not follow what the conclusion of default mode above means.

From the perspective of the design, the StorageClass (or otherwise) will call out the pools and their domain affinities. So for read affine pools this would be the primary AZ, for write affine pools these will be again a single AZ.

From the perspective of domain takeover on AZ failures, this is applicable for read affine pools only, and (in the absence of Rook) will have to be specified in the takeover map by the admin.

Just one small comment, while different storage class for AZ-local pool and inter-AZ pools is OK, I prefer that for inter-AZ read affined pools (we will have 3 of them) we will have only a single SC and CSI driver will select the correct pool according to the topology and the AZ the pod was initially scheduled to. I know that this breaks the 1:1 relation we have today between ceph pools and SCs but I understood there are other reasons to break this 1:1 relation and mak e it 1:n. If this can be done now it is preferred as we don't want to stick with this limitation which was done for the wrong reasons.

The design is as per the above, with the StorageClass calling out the various AZ affined pools and CSI making the choice based on the request (which will contain the pod scheduling AZ information).

I think having distinct storage classes for the inter-zone with read-affinity and intra-zone PVs makes sense. This is basically how Google has approached the problem -

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: faster provisioner: kubernetes.io/gce-pd parameters: type: pd-ssd

https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/ssd-pd

kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: regionalpd-storageclass provisioner: kubernetes.io/gce-pd parameters: type: pd-standard replication-type: regional-pd zones: europe-west1-b, europe-west1-c

https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/regional-pd

ShyamsundarR · 2019-12-20T14:01:42Z

Adding @yuvalk and @JoshSalomon for attention.

humblec · 2020-01-05T07:35:18Z

@ShyamsundarR couple of observations about this proposal. While its really good to explain more on "topology aware provisioning" in general or CO specific terminologies ( section before Ceph-CSI topology support design) , Isnt it good to put it on a seperate reference/doc then add a referenceor point to the existing docs ?

This feature is really bound to the layers above/below CSI in many aspects , that said, an orchestrator like Rook, Ceph cluster configurtion..etc. There is a small section in this doc about Rook, https://github.com/ceph/ceph-csi/pull/760/files#diff-133c3c07046f779e3502b6b5f3f1236eR189, however IMO it would be better to detail these bullets or populate the issues/design in Rook about the changes required. It would also become handy if we list a general section regardless of what the orchestrator is on top which has effect on topology aware provisioning in CSI.

humblec · 2020-01-05T07:37:21Z

@travisn whenever you get some time, can you please review this proposal? thanks in advance

ShyamsundarR · 2020-01-17T18:45:51Z

docs/design/proposals/TopologySupport.md

+
+As noted earlier, creation of pools that have a domain affinity, is out of scope for the CSI plugins. However, when presented with a set of pools that have the said property, a `CreateVolume` request with `accessibility_requirements` specified, needs to choose the right pool to create the image in (in the case of RBD), or to redirect file's data to (in the case of CephFS).
+
+The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].


The changes to the StorageClass as specified in #559 will not work, as the StorageClass parameters are of type map[string]string. Thus passing a more complex structure in here will not work (thanks to @JohnStrunk for pointing it out).

Instead a single key:value pair is proposed as below, that has a JSON structue in the value that can be parsed and used by the plugin.

New StorageClass parameter to detail pools and their topology is as follows,

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: csi-rbd-sc parameters: topologyConstrainedPools: | "[{"poolName":"pool0", "domainSegments":[ {"domainLabel":"region","value":"vagrant"}, {"domainLabel":"zone","value":"zone0"}]}, {"poolName":"pool1", "domainSegments":[ {"domainLabel":"region","value":"vagrant"}, {"domainLabel":"zone","value":"zone1"]} ]"

how to handle erasure-coded pools in this case? are we going to have one more field for pool type?

Erasure coded pools do not have affinity almost by definition (you need to read all the data from a single failure domain, meaning you need full replica in each failure domain)

Erasure coded pools do not have affinity almost by definition (you need to read all the data from a single failure domain, meaning you need full replica in each failure domain)

For pools that are fully domain constrained (for read and write traffic to be restricted to within the domain), an erasure coded pool option/opportunity exists. (this is also coded up in the implementation).

For only read affinity requirements, an erasure coded pool will not work as stated, as by definition reads needs to reach out to multiple OSDs, unlike replicated pools.

ShyamsundarR · 2020-01-17T18:49:52Z

docs/design/proposals/TopologySupport.md

+
+The thought around addressing this is, to dynamically inform surviving nodeplugins in other domains to take over the failed domain, as Ceph pools are essentially cross domain accessible. Further, in the event of the failed domain becoming available again, dynamically inform the nodeplugins to giveback the domain.
+
+The proposal is to add a domain takeover map, that passed in via the CSI config map [10]. Each entry in the map would consist of a domain label and the target domain label that should takeover the same. The nodeplugins, in addition to advertizing their domains as read from the domain labels of the running node, will also advertize the domains that they need to takeover.


As we can have a mix of pools and StorageClasses, some which are read-affined (can be served in other domains when primary domain fails), or write-affined (can only be served in their primary domain), when this feature is implemented we would need 2 domain labels registered by the node plugins.

One which can float, for use with read-affined pools

Other that is pinned and cannot be taken over, for use with write-affined pools

The volumes created would leverage the domain labels in the StorageClass topologyConstrainedPools to distinguish the same.

ShyamsundarR · 2020-01-17T18:54:23Z

docs/design/proposals/TopologySupport.md

+
+The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].
+
+Further, the CSI journal that is maintained as an OMap within a **single** pool needs to continue even when the volumes are allocated from different pools. As, we need a single pool that can hold information about the various CSI volumes to maintain idempotent responses to various CSI RPCs.


This requires some changes to where the CSI and image OMaps are maintained,

The CSI volumes OMap will belong to the global pool

The per-image OMap will belong in the pool where the image is stored

The OMap further needs to carry the parent/child pool names, for operations that only pass in the VolumeID and not the volume_context (like DeleteVolume and while processing parent images for VolumeSource)

Thus, the proposal is to add one further key to the csi.volumes. journal that stores the poolname the created image is stored in. Also, the csi.volume.UUID OMap will add a poolname key to point back to the pool where the CSI OMap is present.

as this is the hard requirement we need to document it and we have to mention that pool parameter is only used for omap CSI metadata storage

ShyamsundarR · 2020-01-17T18:56:15Z

docs/design/proposals/TopologySupport.md

+
+The proposal is to add a domain takeover map, that passed in via the CSI config map [10]. Each entry in the map would consist of a domain label and the target domain label that should takeover the same. The nodeplugins, in addition to advertizing their domains as read from the domain labels of the running node, will also advertize the domains that they need to takeover.
+
+When a failed domain recovers, the corresponding entry in the domain takeover map is deleted, resulting in a dynamic update to all nodeplugins regarding the giveback of the said domain.


One caveat here is that, kubernetes does not remove the CSI nodeplugin labels placed on the node. Thus after a giveback, the node labels need to be manually deleted.

Overall takeover and giveback will possibly be implemented in the next phase of this feature, based on analysis of operations that need to be performed by and admin to make this feature useful.

Madhu-1 · 2020-03-13T06:58:20Z

docs/design/proposals/TopologySupport.md

+
+The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].
+
+Further, the CSI journal that is maintained as an OMap within a **single** pool needs to continue even when the volumes are allocated from different pools. As, we need a single pool that can hold information about the various CSI volumes to maintain idempotent responses to various CSI RPCs.


as this is the hard requirement we need to document it and we have to mention that pool parameter is only used for omap CSI metadata storage

Madhu-1 · 2020-03-13T06:59:12Z

docs/design/proposals/TopologySupport.md

+
+As noted earlier, creation of pools that have a domain affinity, is out of scope for the CSI plugins. However, when presented with a set of pools that have the said property, a `CreateVolume` request with `accessibility_requirements` specified, needs to choose the right pool to create the image in (in the case of RBD), or to redirect file's data to (in the case of CephFS).
+
+The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].


how to handle erasure-coded pools in this case? are we going to have one more field for pool type?

Madhu-1 · 2020-03-13T07:01:27Z

docs/design/proposals/TopologySupport.md

+
+The proposal hence is to also include a `pool` parameter in the StorageClass for RBD based volumes, where the CSI OMap journal would be maintained. This can be one of the subset of pools in the list of domain affinity based pools, or a more highly available cross-domain pool.
+
+**NOTE:** CephFS does not need a special `pool` parameter as we store the CSI journal on the metadata pool backing the `fsname` that is passed, which is a singleton. Hence, the same will continue to be leveraged, even when ceohfs data needs to be redirected to different pools.


ceohfs/cephfs

nixpanic · 2020-04-21T14:48:20Z

@ShyamsundarR what is the current state of this document? Topology support has been merged with #816...

humblec · 2020-05-29T07:45:12Z

@ShyamsundarR what is the current state of this document? Topology support has been merged with #816...

@ShyamsundarR appreciated if you can revisit this , so that we can get this merged.

ShyamsundarR · 2020-05-29T13:24:15Z

@ShyamsundarR what is the current state of this document? Topology support has been merged with #816...

@ShyamsundarR appreciated if you can revisit this , so that we can get this merged.

It's in my backlog, but currently other priorities and tasks are more overwhelming, so I will revisit this when I get a breather.

Proposal document addressing the following issues, Updates: ceph#440, ceph#559 Signed-off-by: ShyamsundarR <srangana@redhat.com>

stale · 2020-10-17T20:27:24Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in a month if no further activity occurs. Thank you for your contributions.

stale · 2020-11-22T08:13:14Z

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

dimm0 · 2021-03-24T01:38:18Z

Is this still not implemented?

ShyamsundarR · 2021-03-24T13:30:08Z

Is this still not implemented?

This PR implements the initial support for the same. I thought I had added required documentation, but these are only present in the YAML files as comments on how to configure the same, here are the RBD references,

ceph-csi/deploy/rbd/kubernetes/csi-rbdplugin.yaml

Lines 59 to 63 in eea97ca

    
           # If topology based provisioning is desired, configure required 
        
           # node labels representing the nodes topology domain 
        
           # and pass the label names below, for CSI to consume and advertise 
        
           # its equivalent topology domain 
        
           # - "--domainlabels=failure-domain/region,failure-domain/zone"

ceph-csi/examples/rbd/storageclass.yaml

Lines 7 to 10 in eea97ca

    
           # If topology based provisioning is desired, delayed provisioning of 
        
           # PV is required and is enabled using the following attribute 
        
           # For further information read TODO<doc> 
        
           # volumeBindingMode: WaitForFirstConsumer

ceph-csi/examples/rbd/storageclass.yaml

Lines 86 to 105 in eea97ca

    
           # Add topology constrained pools configuration, if topology based pools 
        
           # are setup, and topology constrained provisioning is required. 
        
           # For further information read TODO<doc> 
        
           # topologyConstrainedPools: | 
        
           #   [{"poolName":"pool0", 
        
           #     "dataPool":"ec-pool0" # optional, erasure-coded pool for data 
        
           #     "domainSegments":[ 
        
           #       {"domainLabel":"region","value":"east"}, 
        
           #       {"domainLabel":"zone","value":"zone1"}]}, 
        
           #    {"poolName":"pool1", 
        
           #     "dataPool":"ec-pool1" # optional, erasure-coded pool for data 
        
           #     "domainSegments":[ 
        
           #       {"domainLabel":"region","value":"east"}, 
        
           #       {"domainLabel":"zone","value":"zone2"}]}, 
        
           #    {"poolName":"pool2", 
        
           #     "dataPool":"ec-pool2" # optional, erasure-coded pool for data 
        
           #     "domainSegments":[ 
        
           #       {"domainLabel":"region","value":"west"}, 
        
           #       {"domainLabel":"zone","value":"zone1"}]} 
        
           #   ]

obnoxxx · 2021-03-29T11:08:43Z

@ShyamsundarR - shouldn't this staled PR for the design doc be reopened, and progressed to be merged?

ShyamsundarR · 2021-03-29T12:28:50Z

@ShyamsundarR - shouldn't this staled PR for the design doc be reopened, and progressed to be merged?

Ideally yes, and merged for reference at least in the future.

Currently this is not allowed to be reopened as master branch is renamed to devel so a new PR would be required. I will add it to my backlog.

jayasai470 · 2021-10-05T20:30:45Z

any idea when this will be merged

ShyamsundarR · 2021-10-06T18:14:19Z

any idea when this will be merged

This is the design proposal, which while ideal to be merged is/was not. The implementation is present in the code. Are you looking to address specifics in the design proposal or the implementation?

jayasai470 · 2021-10-06T21:45:41Z

I was looking into the implementation, nvm found the doc https://github.com/rook/rook/blob/master/Documentation/ceph-cluster-crd.md#osd-topology

travisn · 2021-10-06T22:00:45Z

@jayasai470 That doc is how to setup the CRUSH hierarchy for Ceph to follow the topology layout for the different zones or other failure domains. In addition to setting that up, I understood that you also need read affinity for the volumes from a certain AZ, so you would still need this csi feature.
@ShyamsundarR Where can we find the usage of this csi feature for read affinity?

mykaul · 2021-10-07T13:11:18Z

@travisn - copy-paste of an answer I got from @idryomov that is relevant:
I believe it can be tested today with the help of mapOptions parameter on the storage class (https://github.com/ceph/ceph-csi/blob/devel/docs/deploy-rbd.md#configuration). Stand up a stretch cluster, make sure that the CRUSH map reflects that the cluster is split between e.g. datacenters "dc1" and "dc2" and set mapOptions to e.g. "read_from_replica=localize,crush_location=datacenter:dc1". Any PV created from this storage class would simulate a client that is local to "dc1" and reads from that client would go to OSDs located in "dc1" (whenever possible -- in some cases the client doesn't get to choose and has to direct the read to the primary OSD).

dillaman reviewed Dec 19, 2019

View reviewed changes

ShyamsundarR commented Jan 17, 2020

View reviewed changes

ShyamsundarR mentioned this pull request Jan 17, 2020

Adds per volume encryption with Vault integration #762

Merged

ShyamsundarR mentioned this pull request Feb 4, 2020

Add topology support to ceph-csi #816

Merged

Madhu-1 reviewed Mar 13, 2020

View reviewed changes

nixpanic added component/docs Issues and PRs related to documentation enhancement New feature or request labels Apr 21, 2020

Topology aware provisioning support for Ceph-CSI

31b0766

Proposal document addressing the following issues, Updates: ceph#440, ceph#559 Signed-off-by: ShyamsundarR <srangana@redhat.com>

ceph-csi-bot force-pushed the topology-csi-proposal branch from 8b34371 to 31b0766 Compare August 17, 2020 09:55

stale bot added the wontfix This will not be worked on label Oct 17, 2020

stale bot closed this Nov 22, 2020

humblec mentioned this pull request Aug 10, 2021

csi-rbdplugin - find fastest node? #2385

Closed

travisn mentioned this pull request Oct 7, 2021

CSI driver documentation on AZ read affinity rook/rook#8935

Closed


		The CSI nodeplugins also do not have any specific domain knowledge to present themselves, and would hence rely on the domain information of the cluster that it runs on.

		The proposal is to, feed CSI nodeplugins a list of labels that it can read from the COs to determine its domain and return the same in it's response to a `NodeGetInfo` request.


		Further, the CSI journal that is maintained as an OMap within a single pool needs to continue even when the volumes are allocated from different pools. As, we need a single pool that can hold information about the various CSI volumes to maintain idempotent responses to various CSI RPCs.

		The proposal hence is to also include a `pool` parameter in the StorageClass for RBD based volumes, where the CSI OMap journal would be maintained. This can be one of the subset of pools in the list of domain affinity based pools, or a more highly available cross-domain pool.


		#### Alternatives

		- Use pool labels (if pools can carry user provided labels in Ceph), to filter pools based on labels and their values. Makes adding a pool for a newer domain easier than meddling with the StorageClass, and also keeps the StorageClass leaner.


		NOTE: For `volumeBindingMode` `WaitForFirstConsumer` the first domain specified in the `preferred` list of topologies would be chosen, as this is the domain where the pod would be scheduled.

		The proposal is to, choose a domain at random and provision the volume on the pool the domain belongs to. Further, return an empty `accessible_topology` in the response, as the volume can be accessed from any domain.


		As, kubernetes PVs are immutable and as discussed in "Topology details stored by kubernetes on the PV [9]" the returned topology is stored in the PV, when a domain becomes unavailable it would be required to maintain availability of the volumes.

		The thought around addressing this is, to dynamically inform surviving nodeplugins in other domains to take over the failed domain, as Ceph pools are essentially cross domain accessible. Further, in the event of the failed domain becoming available again, dynamically inform the nodeplugins to giveback the domain.


		As noted earlier, creation of pools that have a domain affinity, is out of scope for the CSI plugins. However, when presented with a set of pools that have the said property, a `CreateVolume` request with `accessibility_requirements` specified, needs to choose the right pool to create the image in (in the case of RBD), or to redirect file's data to (in the case of CephFS).

		The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].


		The thought around addressing this is, to dynamically inform surviving nodeplugins in other domains to take over the failed domain, as Ceph pools are essentially cross domain accessible. Further, in the event of the failed domain becoming available again, dynamically inform the nodeplugins to giveback the domain.

		The proposal is to add a domain takeover map, that passed in via the CSI config map [10]. Each entry in the map would consist of a domain label and the target domain label that should takeover the same. The nodeplugins, in addition to advertizing their domains as read from the domain labels of the running node, will also advertize the domains that they need to takeover.


		The proposal is to add a domain takeover map, that passed in via the CSI config map [10]. Each entry in the map would consist of a domain label and the target domain label that should takeover the same. The nodeplugins, in addition to advertizing their domains as read from the domain labels of the running node, will also advertize the domains that they need to takeover.

		When a failed domain recovers, the corresponding entry in the domain takeover map is deleted, resulting in a dynamic update to all nodeplugins regarding the giveback of the said domain.


		The proposal hence is to also include a `pool` parameter in the StorageClass for RBD based volumes, where the CSI OMap journal would be maintained. This can be one of the subset of pools in the list of domain affinity based pools, or a more highly available cross-domain pool.

		NOTE: CephFS does not need a special `pool` parameter as we store the CSI journal on the metadata pool backing the `fsname` that is passed, which is a singleton. Hence, the same will continue to be leveraged, even when ceohfs data needs to be redirected to different pools.

[Proposal] Topology aware provisioning support for Ceph-CSI #760

[Proposal] Topology aware provisioning support for Ceph-CSI #760

Conversation

ShyamsundarR commented Dec 19, 2019

ShyamsundarR commented Dec 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dillaman Dec 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dillaman Dec 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dillaman Dec 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmgaggle Jan 17, 2020 • edited Loading

Choose a reason for hiding this comment

ShyamsundarR commented Dec 20, 2019

humblec commented Jan 5, 2020 • edited Loading

humblec commented Jan 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nixpanic commented Apr 21, 2020

humblec commented May 29, 2020

ShyamsundarR commented May 29, 2020

stale bot commented Oct 17, 2020

stale bot commented Nov 22, 2020

dimm0 commented Mar 24, 2021

ShyamsundarR commented Mar 24, 2021

obnoxxx commented Mar 29, 2021

ShyamsundarR commented Mar 29, 2021

jayasai470 commented Oct 5, 2021 • edited Loading

ShyamsundarR commented Oct 6, 2021

jayasai470 commented Oct 6, 2021 • edited Loading

travisn commented Oct 6, 2021

mykaul commented Oct 7, 2021

dillaman Dec 20, 2019 •

edited

Loading

dillaman Dec 19, 2019 •

edited

Loading

dillaman Dec 20, 2019 •

edited

Loading

mmgaggle Jan 17, 2020 •

edited

Loading

humblec commented Jan 5, 2020 •

edited

Loading

jayasai470 commented Oct 5, 2021 •

edited

Loading

jayasai470 commented Oct 6, 2021 •

edited

Loading