-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add a proposal for radix hash join #7761
Conversation
|
||
## Background | ||
|
||
Currently, TiDB uses `non-partitioning hash join` to implement `HashJoinExec`. This algorithm starts by building a **single global hash table** for the inner relation which is known as the `build phase`. After the hash table be built, `probe phase` assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> ... After the hash table is built, probe phase
assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously.
Currently, TiDB uses `non-partitioning hash join` to implement `HashJoinExec`. This algorithm starts by building a **single global hash table** for the inner relation which is known as the `build phase`. After the hash table be built, `probe phase` assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously. | ||
|
||
[Shatdal et al.](http://www.inf.uni-konstanz.de/dbis/teaching/ws0203/main-memory-dbms/download/CCA.pdf) identify that when the hash table is larger than the cache size, almost every access to the | ||
hash table results in a cache miss. So the `build phase` and `probe phase` of `non-partitioning hash join` would cause considerable cache miss when writing or reading the single global hash table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... So the build phase
and probe phase
of non-partitioning hash join
might cause considerable cache miss when writing or reading the single global hash table.
|
||
1. Partition algorithm | ||
|
||
We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For "This algorithm clusters on the leftmost or rightmost B bits of the integer hash-value and divide a relation into H sub-relations parallel...", if "This algorithm" is the subject and "clusters" is a verb, then you can change this sentence to
This algorithm clusters on the leftmost or rightmost B bits of the integer hash-value and divides a relation into H sub-relations parallelly(?)...
|
||
We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*. | ||
|
||
The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... to ensure that opened memory pages will be no more than the TLB entries at any point of time...Two or three passes are needed to create cache-sized partitions, since data will be copied...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of
partition phase
, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.
The original algorithm normally would involve two or three passes to create cache-sized partitions, and in each pass, data would be copied, thus it may has high OOM risk. We decide to choose simple 1-pass radix cluster algorithm to avoid OOM risk.
|
||
The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM. | ||
|
||
Furthermore, `partition phase` can be executed parallelly by subdividing both relations into sub-relations that are assigned to individual threads, but there will be contention when different threads write tuples to the same partition simultaneously, especially when data is skewed heavily. In order to resolve this problem, we scan both input relations twice. The first scan computes a set of histograms over the input data, so the exact output size is known for each thread and each partition. Next, a contiguous memory space is allocated for the output. Finally, all threads perform their partitioning without any synchronize. In addition, we use **[]\*chunk.List** to store every partition, concurrent write to **column.nullBitmap** may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...Finally, all threads perform their partitioning without any synchronization...
|
||
3.1 We can define the partition as type chunk.List, thus `HashJoinExec` should maintain an attribute **partitions []*chunk.List**. | ||
|
||
3.2 For outer relation, `partition phase` and `probe phase` would be done batch-by-batch. The size of every batch should be configurable for exploring better performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would -> will
|
||
Since L3 cache is often shared between multiple processors, we'd better make the size of every partition be no more than L2 cache. The size of L2 cache which is supposed to be *L* can be obtained from third-party library, we can set the size of every partition to *L \* 3/4* to ensure that one partition of inner relation, one hash table and one partition of outer relation fit into the L2 cache when the input data obey the uniform distribution. | ||
|
||
The number of partition is controlled by the number of radix bit which is affected by the total size of inner relation. Suppose the total size of inner relation is *S*, the number of radix bit is *B*, we can infer *B = log(S / L)*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of partitions is controlled by the number of radix bits which is affected by the total size of inner relation. Suppose the total size of inner relation is S, the number of radix bits...
|
||
## Rationale | ||
|
||
This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal introduces ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ratio of cache during
the ratio of cache miss during ...
|
||
This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`. | ||
|
||
The side effect is that the memory usage of `HashJoinExec` would increase since the `partition phase` copies the input data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would -> might
|
||
## Implementation | ||
|
||
1. Investigate the third-party libraries for fetching the L2 cache size and calculate the number of radix bit before `build phase`. @XuHuaiyu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit -> bits
|
||
We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*. | ||
|
||
The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of
partition phase
, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.
The original algorithm normally would involve two or three passes to create cache-sized partitions, and in each pass, data would be copied, thus it may has high OOM risk. We decide to choose simple 1-pass radix cluster algorithm to avoid OOM risk.
|
||
The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM. | ||
|
||
Furthermore, `partition phase` can be executed parallelly by subdividing both relations into sub-relations that are assigned to individual threads, but there will be contention when different threads write tuples to the same partition simultaneously, especially when data is skewed heavily. In order to resolve this problem, we scan both input relations twice. The first scan computes a set of histograms over the input data, so the exact output size is known for each thread and each partition. Next, a contiguous memory space is allocated for the output. Finally, all threads perform their partitioning without any synchronize. In addition, we use **[]\*chunk.List** to store every partition, concurrent write to **column.nullBitmap** may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by subdividing both relations into sub-relations that are assigned to individual threads
by subdividing partitions into sub-partitions and assigning them to different threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we use []*chunk.List to store every partition, concurrent write to column.nullBitmap may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan.
Could you please elaborate more on this? thanks
|
||
## Rationale | ||
|
||
This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ratio of cache during
the ratio of cache miss during ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What problem does this PR solve?
Add a proposal for implementing radix hash join.
What is changed and how it works?
Check List
Tests
Code changes
Side effects
none
Related changes