Fix catalog plan for performance #138

mamico · 2022-07-18T22:24:16Z

In a large Plone website with a ZCatalog containing more than 400K objects, we found poor performance in catalog queries. This is probably not surprising.

But upon further analysis, I saw that many long-running queries were about searching on effectiveRange (type DateRangeIndex)

Products.ZCatalog/src/Products/PluginIndexes/DateRangeIndex/DateRangeIndex.py

Lines 282 to 285 in 5a3272f

    
           if resultset is None: 
        
               # Aggregate sets for each bucket separately, to avoid 
        
               # large-small union penalties. 
        
               until_only = multiunion(self._until_only.values(term))

The problem is that DateRangIndex (but also other indexes) behaves very differently if it is called as first (with resultset equal to None) or not. The main goal of the catalog plan is to solve the above problem by ordering the queries to the indexes from fastest to slowest and executing the queries in that order. So, I was surprised that sometimes the effectiveRange was executed first, resulting in a large performance penalty.

This is what I hypothesized: suppose we have a search with path (type PathIndex) and effectiveRange (type DateRangeIndex). In the first time, without a plan, the indexes are sorted alphabetically, so effectiveRange is the first one executed. In subsequent searches, the plan, correctly, sorts the indexes and the path query is executed first, followed by the slower effectiveRange.

But if, at some point, the path query returns no results, effectiveRange will not be executed

Products.ZCatalog/src/Products/ZCatalog/Catalog.py

Lines 638 to 639 in 5a3272f

    
           if not rs: 
        
               break

and in the query plan, for an index that does not have a benchmark in the current search, the new benchmark saved will be (0, 0, False) (i.e., 0s duration, 0 hits and the last boolean means that the index does not filter results)

Products.ZCatalog/src/Products/ZCatalog/plan.py

Lines 306 to 307 in 5a3272f

    
           if key not in self.benchmark.keys(): 
        
               self.benchmark[key] = Benchmark(0, 0, False)

With this benchmark effectiveRange becomes the first index used in the next search, resulting in poor performance.

In the proposed implementation, I changed this last behavior: if the search was not performed for an index, the benchmark used remains the previous one, if it exists.

I think there is still more room for improvement in the code, studying more sophisticated query planners. But, of course, adding these elements will also add more complexity.

In the meantime, in all the real-world situations where I have tested the implementation proposed here, I have experienced a terrific improvement in performance.

p.s. For easily try the code here, I have just released a package https://pypi.org/project/experimental.catalogplan/ with a monkey patch that implements the same changes proposed here.

davisagli

Nice analysis, fix, and test. I don't have a good place to try this out at the moment, but the code makes sense to me, and it seems pretty low risk.

Don't forget to add an entry to the changelog!

ale-rt

Impressive!
As mentioned by David this is missing a changelog entry :)

jensens

Good catch!

davisagli · 2022-07-22T01:27:56Z

@icemac @dataflake Could we have a new release, please?

mamico · 2022-07-22T07:28:03Z

@icemac @dataflake Could we have a new release, please?

Please, just a sec, I've another PR that, after review, we can add, or not, to the next release. I will submit in the next hours.

icemac · 2022-08-03T06:18:29Z

I just created a release, see https://pypi.org/project/Products.ZCatalog/6.3/.

mamico added 2 commits July 16, 2022 17:47

better catalog plan to improve search performance

be338e8

py27 tests

c4d70a4

mamico requested review from jensens and ale-rt July 18, 2022 22:24

davisagli approved these changes Jul 19, 2022

View reviewed changes

ale-rt approved these changes Jul 19, 2022

View reviewed changes

jensens approved these changes Jul 19, 2022

View reviewed changes

mamico mentioned this pull request Jul 21, 2022

Implements IDateRangeIndex to exclude DateRecurringIndex by indexes with value in the keys of the catalog plan collective/Products.DateRecurringIndex#8

Merged

changelog

a084885

davisagli merged commit af7c5cc into master Jul 22, 2022

davisagli deleted the mamico/catalogplan branch July 22, 2022 01:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix catalog plan for performance #138

Fix catalog plan for performance #138

mamico commented Jul 18, 2022 •

edited

Loading

davisagli left a comment

ale-rt left a comment

jensens left a comment

davisagli commented Jul 22, 2022

mamico commented Jul 22, 2022 •

edited

Loading

icemac commented Aug 3, 2022

	if resultset is None:
	# Aggregate sets for each bucket separately, to avoid
	# large-small union penalties.
	until_only = multiunion(self._until_only.values(term))

	if key not in self.benchmark.keys():
	self.benchmark[key] = Benchmark(0, 0, False)

Fix catalog plan for performance #138

Fix catalog plan for performance #138

Conversation

mamico commented Jul 18, 2022 • edited Loading

davisagli left a comment

Choose a reason for hiding this comment

ale-rt left a comment

Choose a reason for hiding this comment

jensens left a comment

Choose a reason for hiding this comment

davisagli commented Jul 22, 2022

mamico commented Jul 22, 2022 • edited Loading

icemac commented Aug 3, 2022

mamico commented Jul 18, 2022 •

edited

Loading

mamico commented Jul 22, 2022 •

edited

Loading