Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limits for RBD create PVC from snapshot #1098

Closed
dillaman opened this issue May 22, 2020 · 16 comments · Fixed by #1195
Closed

Limits for RBD create PVC from snapshot #1098

dillaman opened this issue May 22, 2020 · 16 comments · Fixed by #1195
Labels
component/rbd Issues related to RBD
Milestone

Comments

@dillaman
Copy link

dillaman commented May 22, 2020

Describe the bug

Prior to the GA release of snapshot support, we need to ensure that the ceph-csi driver enforces some sane limits on snapshot creation and creating PVCs from snapshots.

RBD snapshot limits

  • no more than 510 RBD snapshots per RBD image

RBD clone limits

  • no more than 15 images in clone chain
  • no more than TBD total clones from a parent image

The ceph-csi driver can attempt to hide this internal limits by flattening child images as necessary to provide more "space" for future snapshots / cloned images.

@nixpanic nixpanic added the component/rbd Issues related to RBD label May 25, 2020
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jun 1, 2020

@dillaman isn't covered by flattening the image based on the hard and soft limit? do we need anything else?

@dillaman
Copy link
Author

dillaman commented Jun 1, 2020

The flatten logic only protects against the "no more than 15 images in the chain" test. For the 511 snapshots per image, that's a new test that could be solved by flattening older k8s snapshots off the RBD image so that the snapshots can be removed. Same w/ the no more than X total clones.

@ShyamsundarR
Copy link
Contributor

@dillaman as CSI RBD snapshots will be RBD clones of the parent image and the intermediate RBD snapshot from which these are cloned would be deleted post the clone is completed, does the 511 limitation pertain to,

  • The number of intermediate RBD snapshots that are in-flight before the RBD clone is done, or
  • Total number of RBD clones (that represent CSI-Snapshots) per RBD image?

If the former, CSI should handle this as some form of allowed active snapshot requests in flight per PV/image, and report back a temporary resource exhaustion for a request that breaches the stated limit.

If the latter, then CSI should flatten future snapshots (or maybe deny them?), as past CSI-Snapshots maybe in use for other clone from snapshot CSI-Create requests.

The X total clones, can follow a similar logic to the latter case above. Is there a reason to prefer flattening older snaps or clones, than newer ones?

@dillaman
Copy link
Author

dillaman commented Jun 1, 2020

@dillaman as CSI RBD snapshots will be RBD clones of the parent image and the intermediate RBD snapshot from which these are cloned would be deleted post the clone is completed, does the 511 limitation pertain to,

* The number of intermediate RBD snapshots that are in-flight before the RBD clone is done, or

* Total number of RBD clones (that represent CSI-Snapshots) per RBD image?

Neither -- it's the total number of RBD snapshots on an RBD image (i.e. rbd snap ls --all). This total includes any and all snapshots that are in the trash. Remember also that k8s snapshots can be deleted after you create a new PVC from it, so even though the RBD image associated with the k8s snapshot is moved to the trash, it's still linked to the parent.

If the latter, then CSI should flatten future snapshots (or maybe deny them?), as past CSI-Snapshots maybe in use for other clone from snapshot CSI-Create requests.

The X total clones, can follow a similar logic to the latter case above. Is there a reason to prefer flattening older snaps or clones, than newer ones?

If you prefer the older ones, you could have a soft vs hard-limit to potentially kick-off a background flatten task on the older ones to avoid the need pause new k8s snapshot creation.

@ShyamsundarR
Copy link
Contributor

@dillaman as CSI RBD snapshots will be RBD clones of the parent image and the intermediate RBD snapshot from which these are cloned would be deleted post the clone is completed, does the 511 limitation pertain to,

* The number of intermediate RBD snapshots that are in-flight before the RBD clone is done, or

* Total number of RBD clones (that represent CSI-Snapshots) per RBD image?

Neither -- it's the total number of RBD snapshots on an RBD image (i.e. rbd snap ls --all). This total includes any and all snapshots that are in the trash. Remember also that k8s snapshots can be deleted after you create a new PVC from it, so even though the RBD image associated with the k8s snapshot is moved to the trash, it's still linked to the parent.

Yikes! understood. We (CSI) were just using the trash as a convenient parking lot I guess, so those count as well.

If the latter, then CSI should flatten future snapshots (or maybe deny them?), as past CSI-Snapshots maybe in use for other clone from snapshot CSI-Create requests.
The X total clones, can follow a similar logic to the latter case above. Is there a reason to prefer flattening older snaps or clones, than newer ones?

If you prefer the older ones, you could have a soft vs hard-limit to potentially kick-off a background flatten task on the older ones to avoid the need pause new k8s snapshot creation.

We may need to start flattening sooner than later then, with soft/hard limits as you state. @Madhu-1 we may need to rethink this based on our earlier conversation.

@dillaman
Copy link
Author

dillaman commented Jun 1, 2020

Yikes! understood. We (CSI) were just using the trash as a convenient parking lot I guess, so those count as well.

Correct. CephFS is going to have its own set of limits so we should also probably starting documenting that somewhere as well (hopefully w/ a similar end-result that the CSI can hide it).

@humblec humblec added this to the release-3.0.0 milestone Jun 3, 2020
@humblec
Copy link
Collaborator

humblec commented Jun 3, 2020

@dillaman Having such limits as mentioned in this issue is good to have. At the same time, we have to see how this is going to effect our scalability requirements. I believe this is going to be bit of trial and error and adopting what we can do best from our end against it.

As a side question# is there an option available in rbd where we can set or trigger the deletion in the backend/trash/purge cache ?

@dillaman
Copy link
Author

dillaman commented Jun 3, 2020

At the same time, we have to see how this is going to effect our scalability requirements.

Scalability means nothing if ceph-csi breaks Ceph

@ShyamsundarR
Copy link
Contributor

@dillaman do snapshots in trash (or otherwise) of parents in a clone chain count towards the 511 total snapshots limit?

IOW, assuming I did the following,

  • create image -> snap image -> clone snap -> rm snap -> repeat for one more clone,

I would end up with 1 snap for the first image in trash, and one snap for the next cloned image in trash. Now, the image at tail of this chain (test2) can have 511 snapshots or can only have 509 (511 - 2) snapshots? IOW, the limit is per RBD image, or for snapshots of images in the clone chain as well? (please say it is the former :) )

FWIW, I created 513 snapshots of test2 image in the example below and it worked, but I did not mount the image. The limit though I assume is due to the kernel mounter?
NOTE: The above 513 experiment was not to see if this was a hard limit, but to just inspect system behavior at these limits

# rbd create replicapool/test --size 1G

# rbd snap create replicapool/test@snap

# rbd snap ls --all replicapool/test
SNAPID NAME SIZE  PROTECTED TIMESTAMP                NAMESPACE 
    10 snap 1 GiB           Thu Jun  4 00:59:30 2020 user      

# rbd clone --rbd-default-clone-format 2 --image-feature layering,deep-flatten replicapool/test@snap replicapool/test1

# ./tbox rbd snap rm replicapool/test@snap

# rbd snap ls --all replicapool/test
SNAPID NAME                                 SIZE  PROTECTED TIMESTAMP                NAMESPACE    
    10 ee81f209-c3d8-414c-a710-7584cc47068d 1 GiB           Thu Jun  4 00:59:30 2020 trash (snap) 

# rbd snap create replicapool/test1@snap1

# rbd clone --rbd-default-clone-format 2 --image-feature layering,deep-flatten replicapool/test1@snap1 replicapool/test2

# rbd snap rm replicapool/test1@snap1

# rbd snap ls --all replicapool/test
SNAPID NAME                                 SIZE  PROTECTED TIMESTAMP                NAMESPACE    
    10 ee81f209-c3d8-414c-a710-7584cc47068d 1 GiB           Thu Jun  4 00:59:30 2020 trash (snap) 

# rbd snap ls --all replicapool/test1
SNAPID NAME                                 SIZE  PROTECTED TIMESTAMP                NAMESPACE     
    11 ce5050d7-d029-4767-bc9a-9fb1e0786c7e 1 GiB           Thu Jun  4 01:06:14 2020 trash (snap1) 

@dillaman
Copy link
Author

dillaman commented Jun 4, 2020

The limit is in krbd (and kernel CephFS) since they only allocate 1 4KiB page to handle all the snapshot ids for an image / file.

The snapshot limit only counts for the image where the snapshot actually exists -- it does not apply to the total number of snapshots in the entire grandparent-parent-child hierarchy.

@ShyamsundarR
Copy link
Contributor

Laying out steps as discussed with @Madhu-1 (and based on various comments and discussions, in the snapshot PRs and in this issue from @dillaman ) for implementation.

Ensuring clone depth is in check

N: Configured hard limit for image depth
NOTE: There is a soft limit also configured, where flatten would be started as an async task, when this is reached

  • CSI-Snapshot of a volume:

    • Ensure any created CSI-snapshot has maximum depth <= N-2
    • This ensures that any CSI-clone from this CSI-snapshot will be of depth N-1 and can be created without requiring a flatten during the clone operation
    • Future CSI-snapshots of the CSI-clone, which is at N-1 depth, would create and flatten the CSI-snapshot before returning the ready to use as true
      • Make this an async task as well, such that the volume is not locked for the duration of the flatten
  • CSI-Clone of a volume:

    • In this case it is not possible to retain a similar scheme to the CSI-Snapshot
    • Assuming we desire a CSI-clone whose depth is always < N, this means the volume we clone from should be at depth < N-2 (N-2 would be the depth of the intermediate RBD-clone for the operation)
      • Source: N-3 (depth) -> Intermediate: N-2 -> Clone: N-1
    • Now, CSI-cloning from the above series would start the CSI-clone at an image of depth N-1, hence the intermediate RBD-clone would need flattening, thus returning a non-final error for the CreateVolume call
      • Source: N-1 -> Intermediate: N hence, requires flattening before creating the actual CSI-clone
    • CSI-Clone of a volume should flatten the intermediate snapshot if it is at depth >= N

Ensuring total snapshot count is in check

K: Configured maximum number of all snapshots for an image (including ones in trash)
NOTE: Flattening the CSI-Snapshot ensures that the intermediate snapshot in trash of the parent image is garbage collected, thus reducing the total snapshot count. The intermediate snapshot image is flattened as that is never mapped and in-use by clients.

  • CSI-Snapshot of a volume:

    • Start flattening oldest snapshot when soft limit is reached
    • Based on a snapshot schedule (or even otherwise), oldest snapshots have a better probability of surviving longer, and hence are better candidates for flattening than recent ones
      • e.g. a hourly+daily+weekly schedule of snapshots, will soon end with weekly snapshots as the oldest surviving snapshots, followed by the dailys and so on. Also, the weekly are the ones that would be retained longer, hence flattening the oldest would be more useful
      • OTOH, an hourly schedule would garbage collect the oldest first, hence flattening the tail may be not as useful, as the tail entries would be the first to be pruned
    • If current snapshot would breach the hard limit, return an error
      • As the very act of creating the snapshot would breach the hard limit, and cause issues. Thus, it is a resource exhaustion error and the same (RESOURCE_EXHAUSTED) can be returned for the CreateSnapshot call
      • This is detected first, and hence there will be no clone/snapshot to garbage collect
  • CSI-Clone of a volume:

    • If the image to clone will breach hard limit for total RBD-snapshots, return a RESOURCE_EXHAUSTED error
      • As the act of creating the initial RBD-snapshot itself would be an error
      • This is detected first, and hence there will be no clone/snapshot to garbage collect
    • Trigger flatten at soft limit as required, of the intermediate snapshot image

@dillaman
Copy link
Author

dillaman commented Jun 5, 2020

If current snapshot would breach the hard limit, return an error

Should it return an error or would it be better to just return a "PENDING" error code so that it's retried periodically while a background flatten is taking place?

@ShyamsundarR
Copy link
Contributor

If current snapshot would breach the hard limit, return an error

Should it return an error or would it be better to just return a "PENDING" error code so that it's retried periodically while a background flatten is taking place?

Thinking along the lines that snapshot should be as instantaneous as possible (with possible future application and fs quiesce in play in the overall workflow), an error seems better as we would have not started any work to create the snapshot.

The case where we return PENDING, for clones or while flattening for a snapshot image, is safer, as the snapshot is already taken, and we are post processing the same.

In this corner case, we are yet to take one, hence error out is acceptable. In an "ideal" scenario the error stating resources exhausted should be handled gracefully by the callers.

@dillaman
Copy link
Author

dillaman commented Jun 5, 2020

In this corner case, we are yet to take one, hence error out is acceptable. In an "ideal" scenario the error stating resources exhausted should be handled gracefully by the callers.

Ack -- worst case the logic can be tweaked if it causes UX concerns down the road.

@ShyamsundarR
Copy link
Contributor

@dillaman The MAX snapshot per image limit should be 510 from the kernel sources. Maybe this is more bleeding edge than versions we are considering, where it can be 511?

@dillaman
Copy link
Author

dillaman commented Jun 6, 2020

@dillaman The MAX snapshot per image limit should be 510 from the kernel sources. Maybe this is more bleeding edge than versions we are considering, where it can be 511?

Nope -- you are correct @ 510. Of course, I'd imagine we would want the CSI hard limit well below that (i.e. 5-10% reserve minimum). There is a large performance hit for small IOs with you have hundreds of snapshots since each write carries along that full list of snapshots again (i.e. so a 512 byte write might have 4KiB of additional overhead just listing the snapshots).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants