Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow support of non-equi joins #6922

Closed
pnowojski opened this issue Dec 19, 2016 · 1 comment
Closed

Slow support of non-equi joins #6922

pnowojski opened this issue Dec 19, 2016 · 1 comment

Comments

@pnowojski
Copy link
Member

pnowojski commented Dec 19, 2016

From time to time we stumble on queries for which the most selective "join predicate" is non-equi. For example query posted by CERN earlier this year:

https://databricks.com/blog/2016/10/03/voice-from-cern-apache-spark-2-0-performance-improvements-investigated-with-flame-graphs.html

select a.bucket, sum(a.val2) tot from t1 a, t1 b 
where a.bucket=b.bucket and a.val1+b.val1<1000 group by a.bucket order by a.bucket

which can be simplified to something like this:

select count(*) from lineitem l1, lineitem l2 
where l1.suppkey=l2.suppkey and l1.quantity + l2.quantity > 80;

One solution would be to implement sort-merge join, but it would require lots of changes in the code base. I think simpler solution would be to add better positionLinks resolving to our JoinHash.

Currently positionLinks is a one directional list of positions that are "equals" based on equality join condition. Those positionLinks are then traversed and filterFunction is applied to each one of them. This makes it O(n^2) for queries like above (or more precisely O(n * m / NDV + result_size) where n and m are sizes of joined tables and NDV number of distinct values on equality join column).

I have experimented with sorting position links which allows to perform binary search instead of traversing all of the positionLinks linearly, reducing complexity to O(n*log(m) + result_size). Basically for every probe row, we can do binary search on position links to jump at the start of positionLinks for which filterFunction returns true and return no match immediately after first non matching row.

I have working code (pnowojski@d84a842) that covers some simple cases (single column sort order for simple non-equi expressions), but it requires some code refactoring, so I would like to sync up with rest of you before investing more time into this.

@findepi
Copy link
Contributor

findepi commented Nov 23, 2017

Implemented.

@findepi findepi closed this as completed Nov 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants