Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/sfs: linear performance degradation with the number of objects in buckets #203

Closed
giubacc opened this issue Nov 8, 2022 · 7 comments
Assignees
Labels
area/rgw-sfs RGW & SFS related kind/enhancement Change that positively impacts existing code
Milestone

Comments

@giubacc
Copy link

giubacc commented Nov 8, 2022

Bucket::_refresh_objects() is a very frequent called function in sfs and it is iterating all the objects held by a bucket.
This exposes sfs to an inherent linear complexity directly proportional with the number of objects held by buckets.
I made a test filling a bucket with 10k objects and the latency increases significantly:

Every time a list_buckets is called there is a magnitude latency of seconds:

2022-11-08T15:17:58.941+0100 7fffed2aa6c0 10 req 0 0.004000061s s3:create_bucket list_buckets: marker (, ), max=1000
2022-11-08T15:18:02.745+0100 7fffed2aa6c0 10 req 0 3.808057070s s3:create_bucket list_buckets: buckets={foo3=:foo3[foo3.1667916356826389689]),foo4=:foo4[foo4.1667916475603620198]),foo5=:foo5[foo5.1667916497598257137])}

We should discuss how to avoid the _refresh_objects call in the sfs::Bucket class.
More in general, _refresh_objects is an inherently slow, linear operation and we should see if we can improve this design.
This also applies to _refresh_buckets function (it is less exposed, but the problem is the same).

@giubacc giubacc added the area/rgw-sfs RGW & SFS related label Nov 8, 2022
@giubacc giubacc added this to S3GW Nov 8, 2022
@giubacc giubacc moved this to Backlog in S3GW Nov 8, 2022
@jecluis
Copy link
Contributor

jecluis commented Nov 9, 2022

This is likely one of the culprits for the performance numbers being so crappy right now.

One of the paths forward may simply be caching objects in memory instead of refreshing all the time, invalidating solely when objects are changed (or deleted, depending on versioning being enabled or not). This can also limit the number of objects in memory, and do a basic LRU in case there's a missing object in cache that needs to be promoted should the cache be full.

We may not even have to refresh if objects are added or changed; we may just need to add a new object or update an existing one. There's a lot of room to be smarter here.

@giubacc
Copy link
Author

giubacc commented Nov 9, 2022

I think the approach to define a maximum amount of memory loaded items per bucket is preferable (maybe something configurable).
Having all in memory definitely does not scale in many ways.
Absolutely agree to update memory objects only when needed.

@jecluis jecluis added the kind/enhancement Change that positively impacts existing code label Nov 9, 2022
@jhmarina jhmarina added this to the v0.9.0 milestone Nov 14, 2022
@jhmarina jhmarina moved this from Backlog to To Do v0.9.0 in S3GW Nov 14, 2022
@jhmarina jhmarina moved this from To Do v0.9.0 to Backlog in S3GW Nov 24, 2022
@giubacc giubacc modified the milestones: v0.9.0, v1.0.0 Nov 29, 2022
@jhmarina jhmarina moved this from Backlog to To Do v0.10.0 in S3GW Dec 2, 2022
@jhmarina jhmarina removed this from the v0.10.0 milestone Dec 5, 2022
@jhmarina jhmarina removed the status in S3GW Dec 5, 2022
@jhmarina
Copy link
Contributor

jhmarina commented Dec 5, 2022

Before we start working on implementing performance improvements, we want to set up a call and plan this properly.

@jhmarina jhmarina moved this to Blocked / On hold 🚧 in S3GW Dec 5, 2022
m-ildefons pushed a commit to m-ildefons/s3gw that referenced this issue Feb 6, 2023
build: add missing 'libcap-devel' to radosgw build
@irq0 irq0 self-assigned this Apr 6, 2023
@irq0
Copy link
Contributor

irq0 commented Apr 6, 2023

https://aquarist-labs.github.io/s3gw-perf-reports/reports/release_comprehensive_v0.14.0.html
supports the importance. The LIST benchmarks are either slow (> 10 seconds latency) or time out.

@irq0
Copy link
Contributor

irq0 commented Apr 6, 2023

Discussion moved on since December (last comment). Summary: aquarist-labs/ceph#124 refactored and removed the object cache. Plan is to rewrite list bucket as SQL queries. Right now it is basically a get all from database and filter the results. I'm working on it.

@irq0 irq0 added this to the v0.15.0 milestone Apr 6, 2023
@irq0 irq0 moved this from Blocked / On hold 🚧 to In Progress 🏗️ in S3GW Apr 6, 2023
@jhmarina jhmarina modified the milestones: v0.15.0, v0.16.0 Apr 25, 2023
@giubacc
Copy link
Author

giubacc commented May 11, 2023

@irq0 is this issue still valid?
Originally I was concerned that sfs code was linearly iterating all the objects inside a bucket, is that still the case?
Can we think to close this issue?

@irq0
Copy link
Contributor

irq0 commented May 12, 2023

It's not solved, but enough changed to warrant a new issue. Closing in favor of #509

@irq0 irq0 closed this as completed May 12, 2023
@github-project-automation github-project-automation bot moved this from In Progress 🏗️ to Done in S3GW May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rgw-sfs RGW & SFS related kind/enhancement Change that positively impacts existing code
Projects
None yet
Development

No branches or pull requests

4 participants