Improve performance of generating distinct interactions #28
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When generating distinct intersections on data with hundreds of
thousands of elements, it grinds to a halt. The time seems to be
roughly O(n^2), meaning that with double the data execition takes
2^2=4x times as long. With the help of profviz, we find the main
source to be a Filter in pushCombination(), which causes a twice
nested loop over the elements.
Minimal benchmark on a fairly beefy computer (5950X, 128 GB RAM)
on Fedora Linux, R 4.1.3 and upsetjs 1.11.0, git hash 4b375a8
Before this PR:
With this PR:
Also, scaling is now closer to O(n) or slightly better.
With 10x the data: