Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Topology aware provisioning support for Ceph-CSI #760

Closed
wants to merge 1 commit into from

Conversation

ShyamsundarR
Copy link
Contributor

Proposal document addressing the following issues,
Updates: #440, #559

Signed-off-by: ShyamsundarR srangana@redhat.com

@ShyamsundarR
Copy link
Contributor Author

Requesting reviews and discussion from: @travisn @JohnStrunk @mmgaggle @dillaman @batrick @humblec @Madhu-1 @nixpanic


The CSI nodeplugins also do not have any specific domain knowledge to present themselves, and would hence rely on the domain information of the cluster that it runs on.

The proposal is to, feed CSI nodeplugins a list of labels that it can read from the COs to determine its domain and return the same in it's response to a `NodeGetInfo` request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rook is already scheduling OSDs by zone (and it will soon need to create new RBD/CephFS pools tied to inter-zone vs intra-zone CRUSH rules), so I would think it could also create stateful sets for at least the node plugins w/ the appropriate zone tag passed in as a parameter to the container. For the controller, it would just need to provide all available zones (which again Rook should know about).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller need not know about the domains, it will be a list of domains as advertised by the nodeplugins.

I am not sure how Rook would pass domain labels specific to the node as a parameter, based on where the stateful set instance is started. As these would be different for each node in the cluster. @travisn any ideas?

The current proposal, that states we can pass a list of labels that needs to be read from the node, also can be revised to make it an InitContainer (as propsed by @JohnStrunk). This can then read labels of the node, and/or other labels/domain-data of interest from other sources (say Rook based data in the cluster) and pass them via a shared config map to the nodeplugin container.

Copy link

@dillaman dillaman Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could create multiple stateful sets -- one per zone. It would then use filters to lock the stateful set pods to the corresponding zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could create multiple stateful sets -- one per zone. It would then use filters to lock the stateful set pods to the corresponding zone.

Ok, why would we do this? Instead of using the node labels and an Init container? I see Rook also leveraging domain information from kubernetes, so finding it difficult to justify why we need Rook or the admin (in the absence of rook) feed this to CSI.

In the extraneous case of additional domain hierarchy, than what is present as labels on the node is to be used, the questions would be where do these come from and how does Rook know which node belongs to the said additional domain. (and even in this case the Init container can leverage the same methods to determine and pass the data to CSI)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rook sets up the CRUSH map based on the node labels described here. But Rook isn't in the picture when the pod is provisioning the storage, it's just the CSI driver. The CSI driver should be able to see what node it's on and then look up the topology labels for the node.


Further, the CSI journal that is maintained as an OMap within a **single** pool needs to continue even when the volumes are allocated from different pools. As, we need a single pool that can hold information about the various CSI volumes to maintain idempotent responses to various CSI RPCs.

The proposal hence is to also include a `pool` parameter in the StorageClass for RBD based volumes, where the CSI OMap journal would be maintained. This can be one of the subset of pools in the list of domain affinity based pools, or a more highly available cross-domain pool.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you only allowed to run a single (active) controller or will this new topology aware provisioning allow you to run an active controller per topology domain? If the former, it seems like you could effectively isolate the domains from each other and therefore you wouldn't need a centralized "global" inter-domain pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can run multiple controllers, but routing of requests to these controllers would be at random by the CO and not follow any domain rules. Every controller instance that advertises the same driver name can get the provision request.

The alternate is to split these into their own StorageClass, and hence use the domain pool to maintain the journal, but that would defeat the transparent single StorageClass for users to use.

NOTE: The default pool to store the CSI journal need not necessarily be inter-zone, at least in this proposal, the user can choose one of the domain pools as the journal store, instead of having to create an inter-zone pool for just the journal needs (if the said journal pool is in accessible though, then provisioning would halt, but not stage/publish of other domain volumes, as the image OMap would reside on the same pool as the image).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under AWZ, Azure, and GCP, are all PVs globally namespaced even if they are tied to a specific zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under AWZ, Azure, and GCP, are all PVs globally namespaced even if they are tied to a specific zone?

No, in AWS and Azure a PV is namespaced to a zone, in GCP it can pick 2 zones if replication is enabled.


#### Alternatives

- Use pool labels (if pools can carry user provided labels in Ceph), to filter pools based on labels and their values. Makes adding a pool for a newer domain easier than meddling with the StorageClass, and also keeps the StorageClass leaner.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you can add a limited amount of application metadata to pools, I would be hesitant to suggest it over just tweaking the StorageClass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does complicate the operation on the CSI end as well, as we need to search all pools and then determine ones that match the domain of interest (this also needs increased privileges?), on the other end of the spectrum it reduces what an admin needs to know to setup a StorageClass, we can pretty much drop pool requirements from the StorageClass if we could read csi specific labels for all the pools.

I am not initially taking this approach, but wanted to leave this an an alternative for discussion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that we prefer having one storage class, and multiple pools so we can assign pool to the pod after it is scheduled to AZ. The alternative (storage class per AZ affined pool) requires some level of static design/management by the admin which we prefer to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that we prefer having one storage class, and multiple pools so we can assign pool to the pod after it is scheduled to AZ. The alternative (storage class per AZ affined pool) requires some level of static design/management by the admin which we prefer to avoid.

The proposal for the StorageClass is as per the above preference. IOW, a pool is selected from the StorageClass based on which AZ the pod needs to run in.


**NOTE:** For `volumeBindingMode` `WaitForFirstConsumer` the first domain specified in the `preferred` list of topologies would be chosen, as this is the domain where the pod would be scheduled.

The proposal is to, choose a domain at random and provision the volume on the pool the domain belongs to. Further, return an empty `accessible_topology` in the response, as the volume can be accessed from any domain.
Copy link

@dillaman dillaman Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just allocate it via the "default" inter-zone pool if you aren't tying it to a topology domain. Otherwise, we would most likely guess incorrectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is an easier initial implementation, reduces confusion on where a request went to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not feasible. IOW, defaulting to the default inter-zone pool. The reason is due to lack of information passed to the CSI plugin about the volumeBindingMode.

When a PV is requested using the SC that has topology based pools parameter defined, there is no way to distinguish what the volumeBindingMode is, and the requested topology requirement contains all values as returned by the various nodeplugins.

As a result, if a PV request comes in based on an SC where topology based pools are specified, we will default to the first topology segment that is passed in the preferred section of the request (or the first in the requisite section if preferred is empty).


As, kubernetes PVs are immutable and as discussed in "Topology details stored by kubernetes on the PV [9]" the returned topology is stored in the PV, when a domain becomes unavailable it would be required to maintain availability of the volumes.

The thought around addressing this is, to dynamically inform surviving nodeplugins in other domains to take over the failed domain, as Ceph pools are essentially cross domain accessible. Further, in the event of the failed domain becoming available again, dynamically inform the nodeplugins to giveback the domain.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They wouldn't be --- if I specified to launch a "cheaper" PVC in us-west-2a to avoid inter-zone traffic costs, my PV wouldn't be reachable from us-west-2b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

  • For the backend ceph storage here is what I am thinking,

If we have a read optimized pool, IOW primary OSDs fall into the same domain, then if said primary domain is inaccessible, we can still access the pool from one of the other domains. In such cases we (as in ceph-csi) can support accessing the cheaper PVC from the surviving domains, iff the user wants the same and will provide a takeover rule for action.

If we have a write optimized pool, the assumption is all OSDs belong to the same domain. As a result, domain unavailability cannot be tackled, as the pool also may become inaccessible, so there is not point in taking over such a domain by surviving domains. If even in this case the pool is accessible (for whatever reason that breaks the domain availability constraint between the CO and the ceph cluster), the user can add a takeover rule for the surviving CO domains.

  • From the CO perspective, as long as the nodeplugins in the surviving domains advertise the currently unavailable domain as supported, we should be fine.

The topology key that ceph-csi (or any CSI driver) would advertise would be like topology.<driver-name>/<domain> and so will not conflict with existing domain labels as understood by the CO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

* For the backend ceph storage here is what I am thinking,

If I create a PVC and bind it to a specific zone (because I want cheaper storage), it would not be available from another zone if the zone fails. Your CRUSH rules would keep the PGs all within the same zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PV reachability from the perspective of accessing the backing ceph store (image or subvolume), or from the perspective of the CO?

* For the backend ceph storage here is what I am thinking,

If I create a PVC and bind it to a specific zone (because I want cheaper storage), it would not be available from another zone if the zone fails. Your CRUSH rules would keep the PGs all within the same zone.

Agreed, this applies if all the OSDs in the PGs were also local to the same domain and fail with the domain.

When this is not the case (say a pool that is read optimized to a domain, by restricting the primary OSD of each PG to belong to the same domain, whereas secondary OSDs are spread to different domains), the pool would be available even on that domain failure, right?

Copy link

@dillaman dillaman Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is not the case (say a pool that is read optimized to a domain, by restricting the primary OSD of each PG to belong to the same domain, whereas secondary OSDs are spread to different domains), the pool would be available even on that domain failure, right?

Just for clarity, there is no such thing as purely read-optimized. You can maybe create a read/write optimized pool, though, with some careful tweaking [1] -- but I'm not sure anyone has any immediate plans for that(?).

Plus, how would you then differentiate in the topology request for a purely zone X PG vs a read/write optimized pool for zone X? Would that just be a choice that you can either configure this as single zone or zone optimized topologies?

[1] https://www.osris.org/article/2019/03/01/ceph-osd-site-affinity

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of [1], yes we are. the plan is to do the following:
a. add a new command to the localpool ceph plugin (see PR: ceph/ceph#31625 ) that create pools with local read affinity.
b. then have Rook utilize this, when creating pools (by looking at the SC)
c. and then this, ie - add topology awareness to Ceph-CSI.

as of topology aware SC, I think that:
a. local only (single AZ) - will have a single failure domain semantic. so it is easy to distinct.
b. for other differences we can add a label, like pool.ceph.io/mode="ReadAffinity". which will also enhance readability of such object as it is more clear to viewer what this SC is about.

I also think a 'normal' (ie not read affined) pool that span 3 AZs can be just the same affined to any of that AZs without any performance degradations (or any other implications). So probably, the easiest solution is to treat all TopologyAware pools as if they require read affinity by default (ie default mode should be "ReadAffinity")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment, while different storage class for AZ-local pool and inter-AZ pools is OK, I prefer that for inter-AZ read affined pools (we will have 3 of them) we will have only a single SC and CSI driver will select the correct pool according to the topology and the AZ the pod was initially scheduled to. I know that this breaks the 1:1 relation we have today between ceph pools and SCs but I understood there are other reasons to break this 1:1 relation and mak e it 1:n. If this can be done now it is preferred as we don't want to stick with this limitation which was done for the wrong reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of [1], yes we are. the plan is to do the following:
a. add a new command to the localpool ceph plugin (see PR: ceph/ceph#31625 ) that create pools with local read affinity.
b. then have Rook utilize this, when creating pools (by looking at the SC)
c. and then this, ie - add topology awareness to Ceph-CSI.

@dillaman The above should clarify why we need the domain takeover and giveback, for local read affinity.

as of topology aware SC, I think that:
a. local only (single AZ) - will have a single failure domain semantic. so it is easy to distinct.
b. for other differences we can add a label, like pool.ceph.io/mode="ReadAffinity". which will also enhance readability of such object as it is more clear to viewer what this SC is about.

I also think a 'normal' (ie not read affined) pool that span 3 AZs can be just the same affined to any of that AZs without any performance degradations (or any other implications). So probably, the easiest solution is to treat all TopologyAware pools as if they require read affinity by default (ie default mode should be "ReadAffinity")

@yuvalk I did not follow what the conclusion of default mode above means.

From the perspective of the design, the StorageClass (or otherwise) will call out the pools and their domain affinities. So for read affine pools this would be the primary AZ, for write affine pools these will be again a single AZ.

From the perspective of domain takeover on AZ failures, this is applicable for read affine pools only, and (in the absence of Rook) will have to be specified in the takeover map by the admin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment, while different storage class for AZ-local pool and inter-AZ pools is OK, I prefer that for inter-AZ read affined pools (we will have 3 of them) we will have only a single SC and CSI driver will select the correct pool according to the topology and the AZ the pod was initially scheduled to. I know that this breaks the 1:1 relation we have today between ceph pools and SCs but I understood there are other reasons to break this 1:1 relation and mak e it 1:n. If this can be done now it is preferred as we don't want to stick with this limitation which was done for the wrong reasons.

The design is as per the above, with the StorageClass calling out the various AZ affined pools and CSI making the choice based on the request (which will contain the pod scheduling AZ information).

Copy link
Member

@mmgaggle mmgaggle Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having distinct storage classes for the inter-zone with read-affinity and intra-zone PVs makes sense. This is basically how Google has approached the problem -

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: faster
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/ssd-pd

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: regionalpd-storageclass
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-standard
  replication-type: regional-pd
  zones: europe-west1-b, europe-west1-c

https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/regional-pd

@ShyamsundarR
Copy link
Contributor Author

Adding @yuvalk and @JoshSalomon for attention.

@humblec
Copy link
Collaborator

humblec commented Jan 5, 2020

@ShyamsundarR couple of observations about this proposal. While its really good to explain more on "topology aware provisioning" in general or CO specific terminologies ( section before Ceph-CSI topology support design) , Isnt it good to put it on a seperate reference/doc then add a referenceor point to the existing docs ?

This feature is really bound to the layers above/below CSI in many aspects , that said, an orchestrator like Rook, Ceph cluster configurtion..etc. There is a small section in this doc about Rook, https://github.com/ceph/ceph-csi/pull/760/files#diff-133c3c07046f779e3502b6b5f3f1236eR189, however IMO it would be better to detail these bullets or populate the issues/design in Rook about the changes required. It would also become handy if we list a general section regardless of what the orchestrator is on top which has effect on topology aware provisioning in CSI.

@humblec
Copy link
Collaborator

humblec commented Jan 5, 2020

@travisn whenever you get some time, can you please review this proposal? thanks in advance


As noted earlier, creation of pools that have a domain affinity, is out of scope for the CSI plugins. However, when presented with a set of pools that have the said property, a `CreateVolume` request with `accessibility_requirements` specified, needs to choose the right pool to create the image in (in the case of RBD), or to redirect file's data to (in the case of CephFS).

The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the StorageClass as specified in #559 will not work, as the StorageClass parameters are of type map[string]string. Thus passing a more complex structure in here will not work (thanks to @JohnStrunk for pointing it out).

Instead a single key:value pair is proposed as below, that has a JSON structue in the value that can be parsed and used by the plugin.

New StorageClass parameter to detail pools and their topology is as follows,

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: csi-rbd-sc
parameters:
  topologyConstrainedPools: |
      "[{"poolName":"pool0",
         "domainSegments":[
           {"domainLabel":"region","value":"vagrant"},
           {"domainLabel":"zone","value":"zone0"}]},
       {"poolName":"pool1",
         "domainSegments":[
           {"domainLabel":"region","value":"vagrant"},
           {"domainLabel":"zone","value":"zone1"]}
       ]"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to handle erasure-coded pools in this case? are we going to have one more field for pool type?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erasure coded pools do not have affinity almost by definition (you need to read all the data from a single failure domain, meaning you need full replica in each failure domain)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erasure coded pools do not have affinity almost by definition (you need to read all the data from a single failure domain, meaning you need full replica in each failure domain)

For pools that are fully domain constrained (for read and write traffic to be restricted to within the domain), an erasure coded pool option/opportunity exists. (this is also coded up in the implementation).

For only read affinity requirements, an erasure coded pool will not work as stated, as by definition reads needs to reach out to multiple OSDs, unlike replicated pools.


The thought around addressing this is, to dynamically inform surviving nodeplugins in other domains to take over the failed domain, as Ceph pools are essentially cross domain accessible. Further, in the event of the failed domain becoming available again, dynamically inform the nodeplugins to giveback the domain.

The proposal is to add a domain takeover map, that passed in via the CSI config map [10]. Each entry in the map would consist of a domain label and the target domain label that should takeover the same. The nodeplugins, in addition to advertizing their domains as read from the domain labels of the running node, will also advertize the domains that they need to takeover.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we can have a mix of pools and StorageClasses, some which are read-affined (can be served in other domains when primary domain fails), or write-affined (can only be served in their primary domain), when this feature is implemented we would need 2 domain labels registered by the node plugins.

  • One which can float, for use with read-affined pools
  • Other that is pinned and cannot be taken over, for use with write-affined pools

The volumes created would leverage the domain labels in the StorageClass topologyConstrainedPools to distinguish the same.


The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].

Further, the CSI journal that is maintained as an OMap within a **single** pool needs to continue even when the volumes are allocated from different pools. As, we need a single pool that can hold information about the various CSI volumes to maintain idempotent responses to various CSI RPCs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires some changes to where the CSI and image OMaps are maintained,

  • The CSI volumes OMap will belong to the global pool
  • The per-image OMap will belong in the pool where the image is stored
  • The OMap further needs to carry the parent/child pool names, for operations that only pass in the VolumeID and not the volume_context (like DeleteVolume and while processing parent images for VolumeSource)

Thus, the proposal is to add one further key to the csi.volumes. journal that stores the poolname the created image is stored in. Also, the csi.volume.UUID OMap will add a poolname key to point back to the pool where the CSI OMap is present.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is the hard requirement we need to document it and we have to mention that pool parameter is only used for omap CSI metadata storage


The proposal is to add a domain takeover map, that passed in via the CSI config map [10]. Each entry in the map would consist of a domain label and the target domain label that should takeover the same. The nodeplugins, in addition to advertizing their domains as read from the domain labels of the running node, will also advertize the domains that they need to takeover.

When a failed domain recovers, the corresponding entry in the domain takeover map is deleted, resulting in a dynamic update to all nodeplugins regarding the giveback of the said domain.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One caveat here is that, kubernetes does not remove the CSI nodeplugin labels placed on the node. Thus after a giveback, the node labels need to be manually deleted.

Overall takeover and giveback will possibly be implemented in the next phase of this feature, based on analysis of operations that need to be performed by and admin to make this feature useful.


The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].

Further, the CSI journal that is maintained as an OMap within a **single** pool needs to continue even when the volumes are allocated from different pools. As, we need a single pool that can hold information about the various CSI volumes to maintain idempotent responses to various CSI RPCs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is the hard requirement we need to document it and we have to mention that pool parameter is only used for omap CSI metadata storage


As noted earlier, creation of pools that have a domain affinity, is out of scope for the CSI plugins. However, when presented with a set of pools that have the said property, a `CreateVolume` request with `accessibility_requirements` specified, needs to choose the right pool to create the image in (in the case of RBD), or to redirect file's data to (in the case of CephFS).

The proposal is to, add the specified pools and their domain affinity in the StorageClass (or related constructs in non-kubernetes COs), for the controller to choose from. The changes to the StorageClass is detailed in this issue [15].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to handle erasure-coded pools in this case? are we going to have one more field for pool type?


The proposal hence is to also include a `pool` parameter in the StorageClass for RBD based volumes, where the CSI OMap journal would be maintained. This can be one of the subset of pools in the list of domain affinity based pools, or a more highly available cross-domain pool.

**NOTE:** CephFS does not need a special `pool` parameter as we store the CSI journal on the metadata pool backing the `fsname` that is passed, which is a singleton. Hence, the same will continue to be leveraged, even when ceohfs data needs to be redirected to different pools.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ceohfs/cephfs

@nixpanic nixpanic added component/docs Issues and PRs related to documentation enhancement New feature or request labels Apr 21, 2020
@nixpanic
Copy link
Member

@ShyamsundarR what is the current state of this document? Topology support has been merged with #816...

@humblec
Copy link
Collaborator

humblec commented May 29, 2020

@ShyamsundarR what is the current state of this document? Topology support has been merged with #816...

@ShyamsundarR appreciated if you can revisit this , so that we can get this merged.

@ShyamsundarR
Copy link
Contributor Author

@ShyamsundarR what is the current state of this document? Topology support has been merged with #816...

@ShyamsundarR appreciated if you can revisit this , so that we can get this merged.

It's in my backlog, but currently other priorities and tasks are more overwhelming, so I will revisit this when I get a breather.

Proposal document addressing the following issues,
Updates: ceph#440, ceph#559

Signed-off-by: ShyamsundarR <srangana@redhat.com>
@ceph-csi-bot ceph-csi-bot force-pushed the topology-csi-proposal branch from 8b34371 to 31b0766 Compare August 17, 2020 09:55
@stale
Copy link

stale bot commented Oct 17, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in a month if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 17, 2020
@stale
Copy link

stale bot commented Nov 22, 2020

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

@stale stale bot closed this Nov 22, 2020
@dimm0
Copy link

dimm0 commented Mar 24, 2021

Is this still not implemented?

@ShyamsundarR
Copy link
Contributor Author

Is this still not implemented?

This PR implements the initial support for the same. I thought I had added required documentation, but these are only present in the YAML files as comments on how to configure the same, here are the RBD references,

  • # If topology based provisioning is desired, configure required
    # node labels representing the nodes topology domain
    # and pass the label names below, for CSI to consume and advertise
    # its equivalent topology domain
    # - "--domainlabels=failure-domain/region,failure-domain/zone"
  • # If topology based provisioning is desired, delayed provisioning of
    # PV is required and is enabled using the following attribute
    # For further information read TODO<doc>
    # volumeBindingMode: WaitForFirstConsumer
  • # Add topology constrained pools configuration, if topology based pools
    # are setup, and topology constrained provisioning is required.
    # For further information read TODO<doc>
    # topologyConstrainedPools: |
    # [{"poolName":"pool0",
    # "dataPool":"ec-pool0" # optional, erasure-coded pool for data
    # "domainSegments":[
    # {"domainLabel":"region","value":"east"},
    # {"domainLabel":"zone","value":"zone1"}]},
    # {"poolName":"pool1",
    # "dataPool":"ec-pool1" # optional, erasure-coded pool for data
    # "domainSegments":[
    # {"domainLabel":"region","value":"east"},
    # {"domainLabel":"zone","value":"zone2"}]},
    # {"poolName":"pool2",
    # "dataPool":"ec-pool2" # optional, erasure-coded pool for data
    # "domainSegments":[
    # {"domainLabel":"region","value":"west"},
    # {"domainLabel":"zone","value":"zone1"}]}
    # ]

@obnoxxx
Copy link

obnoxxx commented Mar 29, 2021

@ShyamsundarR - shouldn't this staled PR for the design doc be reopened, and progressed to be merged?

@ShyamsundarR
Copy link
Contributor Author

@ShyamsundarR - shouldn't this staled PR for the design doc be reopened, and progressed to be merged?

Ideally yes, and merged for reference at least in the future.

Currently this is not allowed to be reopened as master branch is renamed to devel so a new PR would be required. I will add it to my backlog.

@jayasai470
Copy link

jayasai470 commented Oct 5, 2021

any idea when this will be merged

@ShyamsundarR
Copy link
Contributor Author

any idea when this will be merged

This is the design proposal, which while ideal to be merged is/was not. The implementation is present in the code. Are you looking to address specifics in the design proposal or the implementation?

@jayasai470
Copy link

jayasai470 commented Oct 6, 2021

I was looking into the implementation, nvm found the doc https://github.com/rook/rook/blob/master/Documentation/ceph-cluster-crd.md#osd-topology

@travisn
Copy link
Member

travisn commented Oct 6, 2021

@jayasai470 That doc is how to setup the CRUSH hierarchy for Ceph to follow the topology layout for the different zones or other failure domains. In addition to setting that up, I understood that you also need read affinity for the volumes from a certain AZ, so you would still need this csi feature.
@ShyamsundarR Where can we find the usage of this csi feature for read affinity?

@mykaul
Copy link
Contributor

mykaul commented Oct 7, 2021

@travisn - copy-paste of an answer I got from @idryomov that is relevant:
I believe it can be tested today with the help of mapOptions parameter on the storage class (https://github.com/ceph/ceph-csi/blob/devel/docs/deploy-rbd.md#configuration). Stand up a stretch cluster, make sure that the CRUSH map reflects that the cluster is split between e.g. datacenters "dc1" and "dc2" and set mapOptions to e.g. "read_from_replica=localize,crush_location=datacenter:dc1". Any PV created from this storage class would simulate a client that is local to "dc1" and reads from that client would go to OSDs located in "dc1" (whenever possible -- in some cases the client doesn't get to choose and has to direct the read to the primary OSD).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/docs Issues and PRs related to documentation enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.