Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Topology aware provisioning support for Ceph-CSI #760

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
232 changes: 232 additions & 0 deletions docs/design/proposals/TopologySupport.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
# Topology aware provisioning support for Ceph-CSI

This document details the design around adding topology aware provisioning support to ceph CSI drivers.

## Definitions

**NOTE:** Used from kubernetes "[Volume Topology-aware Scheduling](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md#definitions)"

- Topology: Rules to describe accessibility of an object with respect to location in a cluster.
- Domain: A grouping of locations within a cluster. For example, 'node1', 'rack10', 'zone5'.
- Topology Key: A description of a general class of domains. For example, 'node', 'rack', 'zone'.
- Hierarchical domain: Domain that can be fully encompassed in a larger domain. For example, the 'zone1' domain can be fully encompassed in the 'region1' domain.
- Failover domain: A domain that a workload intends to run in at a later time.

## Use case

The need to add topology support in Ceph CSI drivers, comes from the following use-case,

Data access costs (performance and network) are uneven across OSDs, based on where the primary OSDs are located for each placement group (PG) in the crush map for the pool(s) backing the volumes
- The read problem for cross-domain OSDs
- Assuming different PGs have different OSDs as their primary OSD, data reads are served from these primary OSDs
- This creates access to different OSDs, from the client, which maybe not be co-located in the same domain as the client
- Which in turn can cause,
- Increased network latency, impacting performance, assuming the primary OSD is further away or in a different domain than the client and there exists an OSD within the same domain as the client
- Network bandwidth costs in cloud or other environments, where such cross domain access is charged based on usage
- The write problem for cross-domain OSDs
- Writes to OSDs need to cross domain boundaries, for replication/disperse requirements, and hence is a cost that is unavoidable and required to retain cross domain availability of data and accessibility of the volume
- To optimize the write costs, OSDs backing a pool can be selected from a single domain
- But may remain unavailable if the domain is not accessible, which is a trade-off if writes also need to be optimized

To resolve issues such as the one above, we need to introduce topology aware provisioning support, to dynamically provision volumes using pools, whose primary OSDs across PGs, or all OSDs in the PG, are setup to be co-located within a single domain. This enables a transparent choice for applications that are consuming persistent storage from ceph, instead of requesting storage from a particular domain (say via named StorageClasses in kubernetes).

**NOTE:** Creation of pools with all primary OSDs in all PGs co-located within a single domain, or all OSDs co-located within a single domain, is not in scope of this document

## Topology support in CSI specification

CSI nodeplugins report topology labels that "Specifies where (regions, zones, racks, etc.) the node is accessible from." in response to the nodeplugin NodeGetInfo RPC [1]. COs would typically use this information, along with topology information returned via the CreateVolumeResponse, to schedule workloads on nodes that support the specific topology that the volume was created to be accessed from.

CSI controller plugins CreateVolume RPC receives create requests with a `TopologyRequirement` [2] that specifies where the volume must be accessible from. The response to the CreateVolume RPC in turn contains the `Topology` from where the provisioned volume is accessible. The returned `Topology` should adhere to the requested `TopologyRequirement` such that at least some parts of the requirement are satisfied.

**NOTE:** (for completeness) The CSI identity service (part of both the controller and node plugins) declares topology support by the CSI driver via the GetPluginCapabilities RPC [3].

The above pretty much sums up topology related information provided in the CSI spec as such. The questions around implementation that arise are,

#### Where does the CO get required information fill up TopologyRequirement in a CreateVolume request?

The COs determine supported topologies by the CSI driver, by forming a set of domain labels that all node plugins of the said driver advertise. This is used as needed in a CreateVolume request, all the way from restricting the volume to be accessible from a specific topology, or sent as a whole for the CSI controller plugin to decide which topologies the created volume can be accessible from.

#### How do nodeplugins decide which topology they support or even belong to, such that the same can be advertised?

This is slightly more convoluted, as nodeplugins need knowledge of which node they are running on, and what is the domain definition for that node. Both pieces of information, nodeid and domain of the node, are passed back to the COs via the NodeGetInfo RPC.

#### How does the CSI plugin choose where to allocate the volume from?

Given a `CreateVolume` request with topology constraints, the CSI controller plugin needs to decide where to provision the volume from. Typically cloud providers add the constraint to the request that is further made to their management API servers [11]. Also as the nodes are running within the said cloud, the node domain is set to reflect what is know by the cloud providers. In the case of Ceph-CSI, the plugins need to be aware of domain affinity of various pools, to make an educated selection on which pool to create the volume on.

## Topology aware volume implementation in kubernetes

**NOTE:** Kubernetes CSI developer guide provides this [4] section for topology support and how it is built into kubernetes CSI. Elements from the same are not detailed in this section.

**NOTE:** There is also a design document that details kubernetes topology aware volume implementation that can be found here [12]

The following discussion is around how kubernetes passes around topology requirements and stores the same, and as a result, requirements for CSI plugins to be aware of.

#### Kubernetes StorageClass parameter `volumeBindingMode`[5]:

Kubernetes StorageClass defines the parameter `volumeBindingMode` that supports values of `Immediate` and `WaitForFirstConsumer`.

- When `volumeBindingMode` is `WaitForFirstConsumer`, the provision request is made to the CSI controller plugin post the kubernetes scheduler decides on which node to schedule the Pod on. Such `CreateVolume` requests will come in with a `TopologyRequirement` that contains the domain as advertized by the corresponding nodeplugin, listed ahead of all other domains supported by all nodeplugins in the `preferred` section **[reference needed]**.
- When `volumeBindingMode` is `Immediate`, the CreateVolume request is sent a `TopologyRequirement` that is a union of all comains as advertized by all nodeplugins

Due to the above, the values in `TopologyRequirement` may range from a singleton to all supported domains, and thus, the distribution of volumes evenly across pools that support the said domains becomes a requirement of the CSI plugin, when the `volumeBindingMode` is `Immediate`.

#### Kubernetes StorageClass parameter `allowedTopologies`[6]:

As per the kubernetes documentation "When a cluster operator specifies the WaitForFirstConsumer volume binding mode, it is no longer necessary to restrict provisioning to specific topologies in most situations. However, if still required, allowedTopologies can be specified."

If `allowedTopologies` is specified then further restrictions on primary domain for the provisioned volume needs to be applied by the CSI provisioner.

**NOTE:** For the current implementation, `allowedTopologies` is not planned to be supported.

#### Topology details stored by kubernetes on the PV [9]

Kubernetes stores the volume topology that is passed back as a response to a successful `VolumeCreate` request in the PV. As PVs are immutable post creation, this ties the PV to topologies that were sent in the response.

Further, the topology constraint that is stored with the PV only contains a `Required` section, thus any node satisfying the topology returned by `CreateVolume` request can be chosen to schedule the pod on. This requires the CSI driver to respond with a singleton `Required` domain value, even when the volume can be accessed across domains, albeit unevenly, to restrict the scheduling of the pod to the required domain.

Thus, when a domain fails, PVs that are tied to that domain cannot be scheduled on other nodes, unless there is a takeover of the failed domain by other nodes running in other domains.

**NOTE:** `Preferred` topology field is a future concern as of now [9], and may possibly come with weights that can help skewing the domain preference, while returning all topologies that the volume can be accessed from in the response to a `CreateVolume` request.

## Ceph-CSI topology support design

Supporting topology aware provisioning in Ceph-CSI reduces to solving the following problems as a result of the above discussion,

### Determining node domain by the CSI nodeplugin

Currently the Ceph-CSI nodeplugins have no information regarding the domains they belong to. The `node_id`, in `NodeGetInfo` response, itself is picked up from the pod manifest [7]. Kubernetes maintains failure-domains[8], but these cannot be passed in via the pod manifest as yet [13], as these are not supported by the downward APIs[14] (unlike the `node-id` for example).

The CSI nodeplugins also do not have any specific domain knowledge to present themselves, and would hence rely on the domain information of the cluster that it runs on.

The proposal is to, feed CSI nodeplugins a list of labels that it can read from the COs to determine its domain and return the same in it's response to a `NodeGetInfo` request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rook is already scheduling OSDs by zone (and it will soon need to create new RBD/CephFS pools tied to inter-zone vs intra-zone CRUSH rules), so I would think it could also create stateful sets for at least the node plugins w/ the appropriate zone tag passed in as a parameter to the container. For the controller, it would just need to provide all available zones (which again Rook should know about).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller need not know about the domains, it will be a list of domains as advertised by the nodeplugins.

I am not sure how Rook would pass domain labels specific to the node as a parameter, based on where the stateful set instance is started. As these would be different for each node in the cluster. @travisn any ideas?

The current proposal, that states we can pass a list of labels that needs to be read from the node, also can be revised to make it an InitContainer (as propsed by @JohnStrunk). This can then read labels of the node, and/or other labels/domain-data of interest from other sources (say Rook based data in the cluster) and pass them via a shared config map to the nodeplugin container.

Copy link

@dillaman dillaman Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could create multiple stateful sets -- one per zone. It would then use filters to lock the stateful set pods to the corresponding zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could create multiple stateful sets -- one per zone. It would then use filters to lock the stateful set pods to the corresponding zone.

Ok, why would we do this? Instead of using the node labels and an Init container? I see Rook also leveraging domain information from kubernetes, so finding it difficult to justify why we need Rook or the admin (in the absence of rook) feed this to CSI.

In the extraneous case of additional domain hierarchy, than what is present as labels on the node is to be used, the questions would be where do these come from and how does Rook know which node belongs to the said additional domain. (and even in this case the Init container can leverage the same methods to determine and pass the data to CSI)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rook sets up the CRUSH map based on the node labels described here. But Rook isn't in the picture when the pod is provisioning the storage, it's just the CSI driver. The CSI driver should be able to see what node it's on and then look up the topology labels for the node.


For example in the case of the CO being kubernetes the following maybe be passed as an ordered list of domain labels to advertise. The order of labels ensures hierarchical domain relationships. The nodeplugin would read the values of these labels, via kubernetes client API, on the node where it is running and respond with the same in the `NodeGetInfo` request.

```
# ceph csi ... --orchestrator="kubernetes" --domain-labels="failure-domain.beta.kubernetes.io/region;failure-domain.beta.kubernetes.io/zone;failure-domain.mycluster.io/rack" ...
```

**NOTE:** Current proposal is to support only kubernetes as the orchestrator

### StorageClass changes to denote pools with domain affinity

As noted earlier, creation of pools that have a domain affinity, is out of scope for the CSI plugins. However, when presented with a set of pools that have the said property, a `CreateVolume` request with `accessibility_requirements` specified, needs to choose the right pool to create the image in (in the case of RBD), or to redirect file's data to (in the case of CephFS).

The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the StorageClass as specified in #559 will not work, as the StorageClass parameters are of type map[string]string. Thus passing a more complex structure in here will not work (thanks to @JohnStrunk for pointing it out).

Instead a single key:value pair is proposed as below, that has a JSON structue in the value that can be parsed and used by the plugin.

New StorageClass parameter to detail pools and their topology is as follows,

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: csi-rbd-sc
parameters:
  topologyConstrainedPools: |
      "[{"poolName":"pool0",
         "domainSegments":[
           {"domainLabel":"region","value":"vagrant"},
           {"domainLabel":"zone","value":"zone0"}]},
       {"poolName":"pool1",
         "domainSegments":[
           {"domainLabel":"region","value":"vagrant"},
           {"domainLabel":"zone","value":"zone1"]}
       ]"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to handle erasure-coded pools in this case? are we going to have one more field for pool type?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erasure coded pools do not have affinity almost by definition (you need to read all the data from a single failure domain, meaning you need full replica in each failure domain)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erasure coded pools do not have affinity almost by definition (you need to read all the data from a single failure domain, meaning you need full replica in each failure domain)

For pools that are fully domain constrained (for read and write traffic to be restricted to within the domain), an erasure coded pool option/opportunity exists. (this is also coded up in the implementation).

For only read affinity requirements, an erasure coded pool will not work as stated, as by definition reads needs to reach out to multiple OSDs, unlike replicated pools.


Further, the CSI journal that is maintained as an OMap within a **single** pool needs to continue even when the volumes are allocated from different pools. As, we need a single pool that can hold information about the various CSI volumes to maintain idempotent responses to various CSI RPCs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires some changes to where the CSI and image OMaps are maintained,

  • The CSI volumes OMap will belong to the global pool
  • The per-image OMap will belong in the pool where the image is stored
  • The OMap further needs to carry the parent/child pool names, for operations that only pass in the VolumeID and not the volume_context (like DeleteVolume and while processing parent images for VolumeSource)

Thus, the proposal is to add one further key to the csi.volumes. journal that stores the poolname the created image is stored in. Also, the csi.volume.UUID OMap will add a poolname key to point back to the pool where the CSI OMap is present.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is the hard requirement we need to document it and we have to mention that pool parameter is only used for omap CSI metadata storage


The proposal hence is to also include a `pool` parameter in the StorageClass for RBD based volumes, where the CSI OMap journal would be maintained. This can be one of the subset of pools in the list of domain affinity based pools, or a more highly available cross-domain pool.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you only allowed to run a single (active) controller or will this new topology aware provisioning allow you to run an active controller per topology domain? If the former, it seems like you could effectively isolate the domains from each other and therefore you wouldn't need a centralized "global" inter-domain pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can run multiple controllers, but routing of requests to these controllers would be at random by the CO and not follow any domain rules. Every controller instance that advertises the same driver name can get the provision request.

The alternate is to split these into their own StorageClass, and hence use the domain pool to maintain the journal, but that would defeat the transparent single StorageClass for users to use.

NOTE: The default pool to store the CSI journal need not necessarily be inter-zone, at least in this proposal, the user can choose one of the domain pools as the journal store, instead of having to create an inter-zone pool for just the journal needs (if the said journal pool is in accessible though, then provisioning would halt, but not stage/publish of other domain volumes, as the image OMap would reside on the same pool as the image).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under AWZ, Azure, and GCP, are all PVs globally namespaced even if they are tied to a specific zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under AWZ, Azure, and GCP, are all PVs globally namespaced even if they are tied to a specific zone?

No, in AWS and Azure a PV is namespaced to a zone, in GCP it can pick 2 zones if replication is enabled.


**NOTE:** CephFS does not need a special `pool` parameter as we store the CSI journal on the metadata pool backing the `fsname` that is passed, which is a singleton. Hence, the same will continue to be leveraged, even when ceohfs data needs to be redirected to different pools.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ceohfs/cephfs


#### Limitations

- If a new domain is added to the cluster, then the StorageClass must be changed to provide an additional pool that has an affinity to that domain. By default the CSI plugin would otherwise choose to treat the provisioning as if `Immediate` was specified as the `volumeBindingMode`

#### Alternatives

- Use pool labels (if pools can carry user provided labels in Ceph), to filter pools based on labels and their values. Makes adding a pool for a newer domain easier than meddling with the StorageClass, and also keeps the StorageClass leaner.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you can add a limited amount of application metadata to pools, I would be hesitant to suggest it over just tweaking the StorageClass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does complicate the operation on the CSI end as well, as we need to search all pools and then determine ones that match the domain of interest (this also needs increased privileges?), on the other end of the spectrum it reduces what an admin needs to know to setup a StorageClass, we can pretty much drop pool requirements from the StorageClass if we could read csi specific labels for all the pools.

I am not initially taking this approach, but wanted to leave this an an alternative for discussion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that we prefer having one storage class, and multiple pools so we can assign pool to the pod after it is scheduled to AZ. The alternative (storage class per AZ affined pool) requires some level of static design/management by the admin which we prefer to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that we prefer having one storage class, and multiple pools so we can assign pool to the pod after it is scheduled to AZ. The alternative (storage class per AZ affined pool) requires some level of static design/management by the admin which we prefer to avoid.

The proposal for the StorageClass is as per the above preference. IOW, a pool is selected from the StorageClass based on which AZ the pod needs to run in.


### Balancing volume allocation when `TopologyRequirement`, includes more than a single domain

As elaborated in "Kubernetes StorageClass parameter `volumeBindingMode`", both `Immediate` and `WaitForFirstConsumer` `volumeBindingMode`s can request multiple domains in the `TopologyRequirement`. Thus, the CSI plugin has to make a choice on which domain to provision the volume from when the binding requested is `Immediate`.

**NOTE:** For `volumeBindingMode` `WaitForFirstConsumer` the first domain specified in the `preferred` list of topologies would be chosen, as this is the domain where the pod would be scheduled.

The proposal is to, choose a domain at random and provision the volume on the pool the domain belongs to. Further, return an empty `accessible_topology` in the response, as the volume can be accessed from any domain.
Copy link

@dillaman dillaman Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just allocate it via the "default" inter-zone pool if you aren't tying it to a topology domain. Otherwise, we would most likely guess incorrectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is an easier initial implementation, reduces confusion on where a request went to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not feasible. IOW, defaulting to the default inter-zone pool. The reason is due to lack of information passed to the CSI plugin about the volumeBindingMode.

When a PV is requested using the SC that has topology based pools parameter defined, there is no way to distinguish what the volumeBindingMode is, and the requested topology requirement contains all values as returned by the various nodeplugins.

As a result, if a PV request comes in based on an SC where topology based pools are specified, we will default to the first topology segment that is passed in the preferred section of the request (or the first in the requisite section if preferred is empty).


#### Alternatives

- We could be smart here in the future based on available space, or density and other such parameters that is periodically read from the Ceph backend and do a more fair allocation across pools.

### Domain takeover and giveback

As, kubernetes PVs are immutable and as discussed in "Topology details stored by kubernetes on the PV [9]" the returned topology is stored in the PV, when a domain becomes unavailable it would be required to maintain availability of the volumes.

The thought around addressing this is, to dynamically inform surviving nodeplugins in other domains to take over the failed domain, as Ceph pools are essentially cross domain accessible. Further, in the event of the failed domain becoming available again, dynamically inform the nodeplugins to giveback the domain.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They wouldn't be --- if I specified to launch a "cheaper" PVC in us-west-2a to avoid inter-zone traffic costs, my PV wouldn't be reachable from us-west-2b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

  • For the backend ceph storage here is what I am thinking,

If we have a read optimized pool, IOW primary OSDs fall into the same domain, then if said primary domain is inaccessible, we can still access the pool from one of the other domains. In such cases we (as in ceph-csi) can support accessing the cheaper PVC from the surviving domains, iff the user wants the same and will provide a takeover rule for action.

If we have a write optimized pool, the assumption is all OSDs belong to the same domain. As a result, domain unavailability cannot be tackled, as the pool also may become inaccessible, so there is not point in taking over such a domain by surviving domains. If even in this case the pool is accessible (for whatever reason that breaks the domain availability constraint between the CO and the ceph cluster), the user can add a takeover rule for the surviving CO domains.

  • From the CO perspective, as long as the nodeplugins in the surviving domains advertise the currently unavailable domain as supported, we should be fine.

The topology key that ceph-csi (or any CSI driver) would advertise would be like topology.<driver-name>/<domain> and so will not conflict with existing domain labels as understood by the CO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

* For the backend ceph storage here is what I am thinking,

If I create a PVC and bind it to a specific zone (because I want cheaper storage), it would not be available from another zone if the zone fails. Your CRUSH rules would keep the PGs all within the same zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

* For the backend ceph storage here is what I am thinking,

If I create a PVC and bind it to a specific zone (because I want cheaper storage), it would not be available from another zone if the zone fails. Your CRUSH rules would keep the PGs all within the same zone.

Agreed, this applies if all the OSDs in the PGs were also local to the same domain and fail with the domain.

When this is not the case (say a pool that is read optimized to a domain, by restricting the primary OSD of each PG to belong to the same domain, whereas secondary OSDs are spread to different domains), the pool would be available even on that domain failure, right?

Copy link

@dillaman dillaman Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is not the case (say a pool that is read optimized to a domain, by restricting the primary OSD of each PG to belong to the same domain, whereas secondary OSDs are spread to different domains), the pool would be available even on that domain failure, right?

Just for clarity, there is no such thing as purely read-optimized. You can maybe create a read/write optimized pool, though, with some careful tweaking [1] -- but I'm not sure anyone has any immediate plans for that(?).

Plus, how would you then differentiate in the topology request for a purely zone X PG vs a read/write optimized pool for zone X? Would that just be a choice that you can either configure this as single zone or zone optimized topologies?

[1] https://www.osris.org/article/2019/03/01/ceph-osd-site-affinity

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of [1], yes we are. the plan is to do the following:
a. add a new command to the localpool ceph plugin (see PR: ceph/ceph#31625 ) that create pools with local read affinity.
b. then have Rook utilize this, when creating pools (by looking at the SC)
c. and then this, ie - add topology awareness to Ceph-CSI.

as of topology aware SC, I think that:
a. local only (single AZ) - will have a single failure domain semantic. so it is easy to distinct.
b. for other differences we can add a label, like pool.ceph.io/mode="ReadAffinity". which will also enhance readability of such object as it is more clear to viewer what this SC is about.

I also think a 'normal' (ie not read affined) pool that span 3 AZs can be just the same affined to any of that AZs without any performance degradations (or any other implications). So probably, the easiest solution is to treat all TopologyAware pools as if they require read affinity by default (ie default mode should be "ReadAffinity")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment, while different storage class for AZ-local pool and inter-AZ pools is OK, I prefer that for inter-AZ read affined pools (we will have 3 of them) we will have only a single SC and CSI driver will select the correct pool according to the topology and the AZ the pod was initially scheduled to. I know that this breaks the 1:1 relation we have today between ceph pools and SCs but I understood there are other reasons to break this 1:1 relation and mak e it 1:n. If this can be done now it is preferred as we don't want to stick with this limitation which was done for the wrong reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of [1], yes we are. the plan is to do the following:
a. add a new command to the localpool ceph plugin (see PR: ceph/ceph#31625 ) that create pools with local read affinity.
b. then have Rook utilize this, when creating pools (by looking at the SC)
c. and then this, ie - add topology awareness to Ceph-CSI.

@dillaman The above should clarify why we need the domain takeover and giveback, for local read affinity.

as of topology aware SC, I think that:
a. local only (single AZ) - will have a single failure domain semantic. so it is easy to distinct.
b. for other differences we can add a label, like pool.ceph.io/mode="ReadAffinity". which will also enhance readability of such object as it is more clear to viewer what this SC is about.

I also think a 'normal' (ie not read affined) pool that span 3 AZs can be just the same affined to any of that AZs without any performance degradations (or any other implications). So probably, the easiest solution is to treat all TopologyAware pools as if they require read affinity by default (ie default mode should be "ReadAffinity")

@yuvalk I did not follow what the conclusion of default mode above means.

From the perspective of the design, the StorageClass (or otherwise) will call out the pools and their domain affinities. So for read affine pools this would be the primary AZ, for write affine pools these will be again a single AZ.

From the perspective of domain takeover on AZ failures, this is applicable for read affine pools only, and (in the absence of Rook) will have to be specified in the takeover map by the admin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment, while different storage class for AZ-local pool and inter-AZ pools is OK, I prefer that for inter-AZ read affined pools (we will have 3 of them) we will have only a single SC and CSI driver will select the correct pool according to the topology and the AZ the pod was initially scheduled to. I know that this breaks the 1:1 relation we have today between ceph pools and SCs but I understood there are other reasons to break this 1:1 relation and mak e it 1:n. If this can be done now it is preferred as we don't want to stick with this limitation which was done for the wrong reasons.

The design is as per the above, with the StorageClass calling out the various AZ affined pools and CSI making the choice based on the request (which will contain the pod scheduling AZ information).

Copy link
Member

@mmgaggle mmgaggle Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having distinct storage classes for the inter-zone with read-affinity and intra-zone PVs makes sense. This is basically how Google has approached the problem -

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: faster
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/ssd-pd

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: regionalpd-storageclass
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-standard
  replication-type: regional-pd
  zones: europe-west1-b, europe-west1-c

https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/regional-pd


The proposal is to add a domain takeover map, that passed in via the CSI config map [10]. Each entry in the map would consist of a domain label and the target domain label that should takeover the same. The nodeplugins, in addition to advertizing their domains as read from the domain labels of the running node, will also advertize the domains that they need to takeover.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we can have a mix of pools and StorageClasses, some which are read-affined (can be served in other domains when primary domain fails), or write-affined (can only be served in their primary domain), when this feature is implemented we would need 2 domain labels registered by the node plugins.

  • One which can float, for use with read-affined pools
  • Other that is pinned and cannot be taken over, for use with write-affined pools

The volumes created would leverage the domain labels in the StorageClass topologyConstrainedPools to distinguish the same.


When a failed domain recovers, the corresponding entry in the domain takeover map is deleted, resulting in a dynamic update to all nodeplugins regarding the giveback of the said domain.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One caveat here is that, kubernetes does not remove the CSI nodeplugin labels placed on the node. Thus after a giveback, the node labels need to be manually deleted.

Overall takeover and giveback will possibly be implemented in the next phase of this feature, based on analysis of operations that need to be performed by and admin to make this feature useful.


As an example the CSI kubernetes config map is updated as follows for a domain takeover,
```
{
"version": 2,
"topology-takeover": {
"<source-domain-1>": [
"<target-domain-1>",
"<target-domain-2>"
],
"<source-domain-2>": [
"all"
],
"<source-domain-3>": [
"<target-domain-2>"
],
...
},
"clusters": [
{"clusterID1": "<cluster-id>", ...},
...
]
}
```

**NOTE:** topology of a node is common across all ceph clusters that the nodeplugin instance may interact with

#### Limitations

- Need to understand frequency of `NodeGetInfo` calls, such that the CO is updated regarding supported domains by the various nodeplugins on a dynamic takeover or giveback action
- As of now, kubernetes calls `NodeGetInfo` once and has no mechanism of updating this information without a plugin restart

### Snapshots and clone considerations

- **TODO**

## Rook implications

The following are high level assumptions on features that Rook would automate ,
- Feeding of domain labels to read from kubernetes, to CSI nodeplugins
- Creating required pools that have OSDs with required domain affinities
- Automatic domain takeover and giveback handling when a domain is not available and vice verse

## References

[1] [CSI spec NodeGetInfo RPC](https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetinfo)

[2] [CSI spec CreateVolume RPC](https://github.com/container-storage-interface/spec/blob/master/spec.md#createvolume)

[3] [CSI spec GetPluginCapabilities RPC](https://github.com/container-storage-interface/spec/blob/master/spec.md#getplugincapabilities)

[4] [Kubernetes CSI topology details for CSI implementors](https://kubernetes-csi.github.io/docs/topology.html)

[5] [Kubernetes StorageClass `volumeBindingMode` parameter](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode)

[6] [Kubernetes StorageClass `allowedTopologies` parameter](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies)

[7] [Pod manifest feeding node ID to CSI](https://github.com/ceph/ceph-csi/blob/2c9d7114638d3bac043781e300cbb7fdcecff3db/deploy/rbd/kubernetes/v1.14%2B/csi-rbdplugin.yaml#L71-L74)

[8] [Kubernetes failure-domains](https://kubernetes.io/docs/reference/kubernetes-api/labels-annotations-taints/#failure-domainbetakubernetesiozone)

[9] Kubernetes topology labels on the PV:

- [Kubernetes topology design: volume topology specification](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md#volume-topology-specification)
- [Kubernetes PV topology information](https://github.com/kubernetes/kubernetes/blob/fcc35b046860ab03851b53ff34a10f6ee0cdecf9/pkg/apis/core/types.go#L308-L311)

[10] [Ceph-CSI config map reference](https://github.com/ceph/ceph-csi/blob/master/examples/csi-config-map-sample.yaml)

[11] Cloud providers forwarding topology requirements:

- [AWS CSI topology handling](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/7255ebd9d57120729011ffc0c059af6b867e5b13/pkg/driver/controller.go#L149-L180)
- **TODO Azure/GCE link?**

[12] [Kubernetes Volume topology-aware design](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md)

[13] [Lack of node labels access via the downward APIs](https://github.com/kubernetes/kubernetes/issues/40610)

[14] [Kubernetes downward API](https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/)

[15] [Multiple pool storage classes informed by topologyKeys](https://github.com/ceph/ceph-csi/issues/559)