Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix catalog plan for performance #138

Merged
merged 3 commits into from
Jul 22, 2022
Merged

Fix catalog plan for performance #138

merged 3 commits into from
Jul 22, 2022

Conversation

mamico
Copy link
Member

@mamico mamico commented Jul 18, 2022

In a large Plone website with a ZCatalog containing more than 400K objects, we found poor performance in catalog queries. This is probably not surprising.

But upon further analysis, I saw that many long-running queries were about searching on effectiveRange (type DateRangeIndex)

if resultset is None:
# Aggregate sets for each bucket separately, to avoid
# large-small union penalties.
until_only = multiunion(self._until_only.values(term))

The problem is that DateRangIndex (but also other indexes) behaves very differently if it is called as first (with resultset equal to None) or not. The main goal of the catalog plan is to solve the above problem by ordering the queries to the indexes from fastest to slowest and executing the queries in that order. So, I was surprised that sometimes the effectiveRange was executed first, resulting in a large performance penalty.

This is what I hypothesized: suppose we have a search with path (type PathIndex) and effectiveRange (type DateRangeIndex). In the first time, without a plan, the indexes are sorted alphabetically, so effectiveRange is the first one executed. In subsequent searches, the plan, correctly, sorts the indexes and the path query is executed first, followed by the slower effectiveRange.

But if, at some point, the path query returns no results, effectiveRange will not be executed

if not rs:
break
and in the query plan, for an index that does not have a benchmark in the current search, the new benchmark saved will be (0, 0, False) (i.e., 0s duration, 0 hits and the last boolean means that the index does not filter results)
if key not in self.benchmark.keys():
self.benchmark[key] = Benchmark(0, 0, False)
With this benchmark effectiveRange becomes the first index used in the next search, resulting in poor performance.

In the proposed implementation, I changed this last behavior: if the search was not performed for an index, the benchmark used remains the previous one, if it exists.

I think there is still more room for improvement in the code, studying more sophisticated query planners. But, of course, adding these elements will also add more complexity.

In the meantime, in all the real-world situations where I have tested the implementation proposed here, I have experienced a terrific improvement in performance.

p.s. For easily try the code here, I have just released a package https://pypi.org/project/experimental.catalogplan/ with a monkey patch that implements the same changes proposed here.

@mamico mamico requested review from jensens and ale-rt July 18, 2022 22:24
Copy link
Member

@davisagli davisagli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice analysis, fix, and test. I don't have a good place to try this out at the moment, but the code makes sense to me, and it seems pretty low risk.

Don't forget to add an entry to the changelog!

Copy link
Member

@ale-rt ale-rt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive!
As mentioned by David this is missing a changelog entry :)

Copy link
Member

@jensens jensens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@davisagli davisagli merged commit af7c5cc into master Jul 22, 2022
@davisagli davisagli deleted the mamico/catalogplan branch July 22, 2022 01:23
@davisagli
Copy link
Member

@icemac @dataflake Could we have a new release, please?

@mamico
Copy link
Member Author

mamico commented Jul 22, 2022

@icemac @dataflake Could we have a new release, please?

Please, just a sec, I've another PR that, after review, we can add, or not, to the next release. I will submit in the next hours.

@icemac
Copy link
Member

icemac commented Aug 3, 2022

I just created a release, see https://pypi.org/project/Products.ZCatalog/6.3/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants