-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix catalog plan for performance #138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice analysis, fix, and test. I don't have a good place to try this out at the moment, but the code makes sense to me, and it seems pretty low risk.
Don't forget to add an entry to the changelog!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive!
As mentioned by David this is missing a changelog entry :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
@icemac @dataflake Could we have a new release, please? |
Please, just a sec, I've another PR that, after review, we can add, or not, to the next release. I will submit in the next hours. |
I just created a release, see https://pypi.org/project/Products.ZCatalog/6.3/. |
In a large Plone website with a ZCatalog containing more than 400K objects, we found poor performance in catalog queries. This is probably not surprising.
But upon further analysis, I saw that many long-running queries were about searching on
effectiveRange
(typeDateRangeIndex
)Products.ZCatalog/src/Products/PluginIndexes/DateRangeIndex/DateRangeIndex.py
Lines 282 to 285 in 5a3272f
The problem is that
DateRangIndex
(but also other indexes) behaves very differently if it is called as first (withresultset
equal to None) or not. The main goal of the catalog plan is to solve the above problem by ordering the queries to the indexes from fastest to slowest and executing the queries in that order. So, I was surprised that sometimes theeffectiveRange
was executed first, resulting in a large performance penalty.This is what I hypothesized: suppose we have a search with
path
(typePathIndex
) andeffectiveRange
(typeDateRangeIndex
). In the first time, without a plan, the indexes are sorted alphabetically, soeffectiveRange
is the first one executed. In subsequent searches, the plan, correctly, sorts the indexes and thepath
query is executed first, followed by the slowereffectiveRange
.But if, at some point, the
path
query returns no results,effectiveRange
will not be executedProducts.ZCatalog/src/Products/ZCatalog/Catalog.py
Lines 638 to 639 in 5a3272f
Products.ZCatalog/src/Products/ZCatalog/plan.py
Lines 306 to 307 in 5a3272f
effectiveRange
becomes the first index used in the next search, resulting in poor performance.In the proposed implementation, I changed this last behavior: if the search was not performed for an index, the benchmark used remains the previous one, if it exists.
I think there is still more room for improvement in the code, studying more sophisticated query planners. But, of course, adding these elements will also add more complexity.
In the meantime, in all the real-world situations where I have tested the implementation proposed here, I have experienced a terrific improvement in performance.
p.s. For easily try the code here, I have just released a package https://pypi.org/project/experimental.catalogplan/ with a monkey patch that implements the same changes proposed here.