Limits for RBD create PVC from snapshot #1098

dillaman · 2020-05-22T20:44:04Z

Describe the bug

Prior to the GA release of snapshot support, we need to ensure that the ceph-csi driver enforces some sane limits on snapshot creation and creating PVCs from snapshots.

RBD snapshot limits

no more than 510 RBD snapshots per RBD image

RBD clone limits

no more than 15 images in clone chain
no more than TBD total clones from a parent image

The ceph-csi driver can attempt to hide this internal limits by flattening child images as necessary to provide more "space" for future snapshots / cloned images.

Madhu-1 · 2020-06-01T11:17:32Z

@dillaman isn't covered by flattening the image based on the hard and soft limit? do we need anything else?

dillaman · 2020-06-01T12:24:46Z

The flatten logic only protects against the "no more than 15 images in the chain" test. For the 511 snapshots per image, that's a new test that could be solved by flattening older k8s snapshots off the RBD image so that the snapshots can be removed. Same w/ the no more than X total clones.

ShyamsundarR · 2020-06-01T16:44:26Z

@dillaman as CSI RBD snapshots will be RBD clones of the parent image and the intermediate RBD snapshot from which these are cloned would be deleted post the clone is completed, does the 511 limitation pertain to,

The number of intermediate RBD snapshots that are in-flight before the RBD clone is done, or
Total number of RBD clones (that represent CSI-Snapshots) per RBD image?

If the former, CSI should handle this as some form of allowed active snapshot requests in flight per PV/image, and report back a temporary resource exhaustion for a request that breaches the stated limit.

If the latter, then CSI should flatten future snapshots (or maybe deny them?), as past CSI-Snapshots maybe in use for other clone from snapshot CSI-Create requests.

The X total clones, can follow a similar logic to the latter case above. Is there a reason to prefer flattening older snaps or clones, than newer ones?

dillaman · 2020-06-01T17:21:49Z

@dillaman as CSI RBD snapshots will be RBD clones of the parent image and the intermediate RBD snapshot from which these are cloned would be deleted post the clone is completed, does the 511 limitation pertain to,
* The number of intermediate RBD snapshots that are in-flight before the RBD clone is done, or

* Total number of RBD clones (that represent CSI-Snapshots) per RBD image?

Neither -- it's the total number of RBD snapshots on an RBD image (i.e. rbd snap ls --all). This total includes any and all snapshots that are in the trash. Remember also that k8s snapshots can be deleted after you create a new PVC from it, so even though the RBD image associated with the k8s snapshot is moved to the trash, it's still linked to the parent.

If the latter, then CSI should flatten future snapshots (or maybe deny them?), as past CSI-Snapshots maybe in use for other clone from snapshot CSI-Create requests.

The X total clones, can follow a similar logic to the latter case above. Is there a reason to prefer flattening older snaps or clones, than newer ones?

If you prefer the older ones, you could have a soft vs hard-limit to potentially kick-off a background flatten task on the older ones to avoid the need pause new k8s snapshot creation.

ShyamsundarR · 2020-06-01T18:00:10Z

@dillaman as CSI RBD snapshots will be RBD clones of the parent image and the intermediate RBD snapshot from which these are cloned would be deleted post the clone is completed, does the 511 limitation pertain to,
* The number of intermediate RBD snapshots that are in-flight before the RBD clone is done, or

* Total number of RBD clones (that represent CSI-Snapshots) per RBD image?
Neither -- it's the total number of RBD snapshots on an RBD image (i.e. rbd snap ls --all). This total includes any and all snapshots that are in the trash. Remember also that k8s snapshots can be deleted after you create a new PVC from it, so even though the RBD image associated with the k8s snapshot is moved to the trash, it's still linked to the parent.

Yikes! understood. We (CSI) were just using the trash as a convenient parking lot I guess, so those count as well.

If the latter, then CSI should flatten future snapshots (or maybe deny them?), as past CSI-Snapshots maybe in use for other clone from snapshot CSI-Create requests.
The X total clones, can follow a similar logic to the latter case above. Is there a reason to prefer flattening older snaps or clones, than newer ones?

If you prefer the older ones, you could have a soft vs hard-limit to potentially kick-off a background flatten task on the older ones to avoid the need pause new k8s snapshot creation.

We may need to start flattening sooner than later then, with soft/hard limits as you state. @Madhu-1 we may need to rethink this based on our earlier conversation.

dillaman · 2020-06-01T18:04:24Z

Yikes! understood. We (CSI) were just using the trash as a convenient parking lot I guess, so those count as well.

Correct. CephFS is going to have its own set of limits so we should also probably starting documenting that somewhere as well (hopefully w/ a similar end-result that the CSI can hide it).

humblec · 2020-06-03T09:25:34Z

@dillaman Having such limits as mentioned in this issue is good to have. At the same time, we have to see how this is going to effect our scalability requirements. I believe this is going to be bit of trial and error and adopting what we can do best from our end against it.

As a side question# is there an option available in rbd where we can set or trigger the deletion in the backend/trash/purge cache ?

dillaman · 2020-06-03T11:40:37Z

At the same time, we have to see how this is going to effect our scalability requirements.

Scalability means nothing if ceph-csi breaks Ceph

ShyamsundarR · 2020-06-04T01:53:44Z

@dillaman do snapshots in trash (or otherwise) of parents in a clone chain count towards the 511 total snapshots limit?

IOW, assuming I did the following,

create image -> snap image -> clone snap -> rm snap -> repeat for one more clone,

I would end up with 1 snap for the first image in trash, and one snap for the next cloned image in trash. Now, the image at tail of this chain (test2) can have 511 snapshots or can only have 509 (511 - 2) snapshots? IOW, the limit is per RBD image, or for snapshots of images in the clone chain as well? (please say it is the former :) )

FWIW, I created 513 snapshots of test2 image in the example below and it worked, but I did not mount the image. The limit though I assume is due to the kernel mounter?
NOTE: The above 513 experiment was not to see if this was a hard limit, but to just inspect system behavior at these limits

# rbd create replicapool/test --size 1G

# rbd snap create replicapool/test@snap

# rbd snap ls --all replicapool/test
SNAPID NAME SIZE  PROTECTED TIMESTAMP                NAMESPACE 
    10 snap 1 GiB           Thu Jun  4 00:59:30 2020 user      

# rbd clone --rbd-default-clone-format 2 --image-feature layering,deep-flatten replicapool/test@snap replicapool/test1

# ./tbox rbd snap rm replicapool/test@snap

# rbd snap ls --all replicapool/test
SNAPID NAME                                 SIZE  PROTECTED TIMESTAMP                NAMESPACE    
    10 ee81f209-c3d8-414c-a710-7584cc47068d 1 GiB           Thu Jun  4 00:59:30 2020 trash (snap) 

# rbd snap create replicapool/test1@snap1

# rbd clone --rbd-default-clone-format 2 --image-feature layering,deep-flatten replicapool/test1@snap1 replicapool/test2

# rbd snap rm replicapool/test1@snap1

# rbd snap ls --all replicapool/test
SNAPID NAME                                 SIZE  PROTECTED TIMESTAMP                NAMESPACE    
    10 ee81f209-c3d8-414c-a710-7584cc47068d 1 GiB           Thu Jun  4 00:59:30 2020 trash (snap) 

# rbd snap ls --all replicapool/test1
SNAPID NAME                                 SIZE  PROTECTED TIMESTAMP                NAMESPACE     
    11 ce5050d7-d029-4767-bc9a-9fb1e0786c7e 1 GiB           Thu Jun  4 01:06:14 2020 trash (snap1)

dillaman · 2020-06-04T16:35:51Z

The limit is in krbd (and kernel CephFS) since they only allocate 1 4KiB page to handle all the snapshot ids for an image / file.

The snapshot limit only counts for the image where the snapshot actually exists -- it does not apply to the total number of snapshots in the entire grandparent-parent-child hierarchy.

ShyamsundarR · 2020-06-04T22:58:32Z

Laying out steps as discussed with @Madhu-1 (and based on various comments and discussions, in the snapshot PRs and in this issue from @dillaman ) for implementation.

Ensuring clone depth is in check

N: Configured hard limit for image depth
NOTE: There is a soft limit also configured, where flatten would be started as an async task, when this is reached

CSI-Snapshot of a volume:
- Ensure any created CSI-snapshot has maximum depth <= N-2
- This ensures that any CSI-clone from this CSI-snapshot will be of depth N-1 and can be created without requiring a flatten during the clone operation
- Future CSI-snapshots of the CSI-clone, which is at N-1 depth, would create and flatten the CSI-snapshot before returning the ready to use as true
  - Make this an async task as well, such that the volume is not locked for the duration of the flatten
CSI-Clone of a volume:
- In this case it is not possible to retain a similar scheme to the CSI-Snapshot
- Assuming we desire a CSI-clone whose depth is always < N, this means the volume we clone from should be at depth < N-2 (N-2 would be the depth of the intermediate RBD-clone for the operation)
  - Source: N-3 (depth) -> Intermediate: N-2 -> Clone: N-1
- Now, CSI-cloning from the above series would start the CSI-clone at an image of depth N-1, hence the intermediate RBD-clone would need flattening, thus returning a non-final error for the CreateVolume call
  - Source: N-1 -> Intermediate: N hence, requires flattening before creating the actual CSI-clone
- CSI-Clone of a volume should flatten the intermediate snapshot if it is at depth >= N

Ensuring total snapshot count is in check

K: Configured maximum number of all snapshots for an image (including ones in trash)
NOTE: Flattening the CSI-Snapshot ensures that the intermediate snapshot in trash of the parent image is garbage collected, thus reducing the total snapshot count. The intermediate snapshot image is flattened as that is never mapped and in-use by clients.

CSI-Snapshot of a volume:
- Start flattening oldest snapshot when soft limit is reached
- Based on a snapshot schedule (or even otherwise), oldest snapshots have a better probability of surviving longer, and hence are better candidates for flattening than recent ones
  - e.g. a hourly+daily+weekly schedule of snapshots, will soon end with weekly snapshots as the oldest surviving snapshots, followed by the dailys and so on. Also, the weekly are the ones that would be retained longer, hence flattening the oldest would be more useful
  - OTOH, an hourly schedule would garbage collect the oldest first, hence flattening the tail may be not as useful, as the tail entries would be the first to be pruned
- If current snapshot would breach the hard limit, return an error
  - As the very act of creating the snapshot would breach the hard limit, and cause issues. Thus, it is a resource exhaustion error and the same (RESOURCE_EXHAUSTED) can be returned for the CreateSnapshot call
  - This is detected first, and hence there will be no clone/snapshot to garbage collect
CSI-Clone of a volume:
- If the image to clone will breach hard limit for total RBD-snapshots, return a RESOURCE_EXHAUSTED error
  - As the act of creating the initial RBD-snapshot itself would be an error
  - This is detected first, and hence there will be no clone/snapshot to garbage collect
- Trigger flatten at soft limit as required, of the intermediate snapshot image

dillaman · 2020-06-05T13:42:34Z

If current snapshot would breach the hard limit, return an error

Should it return an error or would it be better to just return a "PENDING" error code so that it's retried periodically while a background flatten is taking place?

ShyamsundarR · 2020-06-05T13:49:22Z

If current snapshot would breach the hard limit, return an error

Should it return an error or would it be better to just return a "PENDING" error code so that it's retried periodically while a background flatten is taking place?

Thinking along the lines that snapshot should be as instantaneous as possible (with possible future application and fs quiesce in play in the overall workflow), an error seems better as we would have not started any work to create the snapshot.

The case where we return PENDING, for clones or while flattening for a snapshot image, is safer, as the snapshot is already taken, and we are post processing the same.

In this corner case, we are yet to take one, hence error out is acceptable. In an "ideal" scenario the error stating resources exhausted should be handled gracefully by the callers.

dillaman · 2020-06-05T14:18:40Z

In this corner case, we are yet to take one, hence error out is acceptable. In an "ideal" scenario the error stating resources exhausted should be handled gracefully by the callers.

Ack -- worst case the logic can be tweaked if it causes UX concerns down the road.

ShyamsundarR · 2020-06-06T15:30:59Z

@dillaman The MAX snapshot per image limit should be 510 from the kernel sources. Maybe this is more bleeding edge than versions we are considering, where it can be 511?

dillaman · 2020-06-06T16:21:09Z

@dillaman The MAX snapshot per image limit should be 510 from the kernel sources. Maybe this is more bleeding edge than versions we are considering, where it can be 511?

Nope -- you are correct @ 510. Of course, I'd imagine we would want the CSI hard limit well below that (i.e. 5-10% reserve minimum). There is a large performance hit for small IOs with you have hundreds of snapshots since each write carries along that full list of snapshots again (i.e. so a 512 byte write might have 4KiB of additional overhead just listing the snapshots).

nixpanic added the component/rbd Issues related to RBD label May 25, 2020

humblec added this to the release-3.0.0 milestone Jun 3, 2020

ShyamsundarR mentioned this issue Jun 4, 2020

Limits for CephFS snapshots #1133

Closed

ShyamsundarR mentioned this issue Jun 8, 2020

Add support for standalone snapshot creation and cloning support #693

Closed

Madhu-1 mentioned this issue Jun 10, 2020

rbd: Implement snapshot and clone from snapshot #1160

Merged

9 tasks

This was referenced Jun 26, 2020

Tracker to fix todo in #1160 snapshot PR #1188

Closed

rbd: Handle maximum snapshots on a single rbd image #1195

Merged

mergify bot closed this as completed in #1195 Jul 6, 2020

Madhu-1 mentioned this issue Aug 23, 2021

How can i disabled standalone snapshot creation #675, the unorthodoxy implement cause other problem and confuse me. #2430

Closed

Rakshith-R mentioned this issue Mar 24, 2022

There are a few questions to ask about ceph and ceph-csi #2961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limits for RBD create PVC from snapshot #1098

Limits for RBD create PVC from snapshot #1098

dillaman commented May 22, 2020 •

edited

Loading

Madhu-1 commented Jun 1, 2020

dillaman commented Jun 1, 2020

ShyamsundarR commented Jun 1, 2020

dillaman commented Jun 1, 2020

ShyamsundarR commented Jun 1, 2020

dillaman commented Jun 1, 2020

humblec commented Jun 3, 2020

dillaman commented Jun 3, 2020

ShyamsundarR commented Jun 4, 2020

dillaman commented Jun 4, 2020

ShyamsundarR commented Jun 4, 2020

dillaman commented Jun 5, 2020

ShyamsundarR commented Jun 5, 2020

dillaman commented Jun 5, 2020

ShyamsundarR commented Jun 6, 2020

dillaman commented Jun 6, 2020

Limits for RBD create PVC from snapshot #1098

Limits for RBD create PVC from snapshot #1098

Comments

dillaman commented May 22, 2020 • edited Loading

Describe the bug

Madhu-1 commented Jun 1, 2020

dillaman commented Jun 1, 2020

ShyamsundarR commented Jun 1, 2020

dillaman commented Jun 1, 2020

ShyamsundarR commented Jun 1, 2020

dillaman commented Jun 1, 2020

humblec commented Jun 3, 2020

dillaman commented Jun 3, 2020

ShyamsundarR commented Jun 4, 2020

dillaman commented Jun 4, 2020

ShyamsundarR commented Jun 4, 2020

Ensuring clone depth is in check

Ensuring total snapshot count is in check

dillaman commented Jun 5, 2020

ShyamsundarR commented Jun 5, 2020

dillaman commented Jun 5, 2020

ShyamsundarR commented Jun 6, 2020

dillaman commented Jun 6, 2020

dillaman commented May 22, 2020 •

edited

Loading