docs: add a proposal for radix hash join #7761

XuHuaiyu · 2018-09-21T03:39:28Z

What problem does this PR solve?

Add a proposal for implementing radix hash join.

What is changed and how it works?

Check List

Tests

No code

Code changes

Side effects
none

Related changes

XuHuaiyu · 2018-09-21T03:39:49Z

PTAL @zz-jason @winoros @eurekaka

CaitinChen · 2018-09-21T03:45:53Z

docs/design/2018-09-21-radix-hashjoin.md

+
+## Background
+
+Currently, TiDB uses `non-partitioning hash join` to implement `HashJoinExec`. This algorithm starts by building a **single global hash table** for the inner relation which is known as the `build phase`. After the hash table be built, `probe phase` assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously.


-> ... After the hash table is built, probe phase assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously.

CaitinChen · 2018-09-21T03:49:28Z

docs/design/2018-09-21-radix-hashjoin.md

+Currently, TiDB uses `non-partitioning hash join` to implement `HashJoinExec`. This algorithm starts by building a **single global hash table** for the inner relation which is known as the `build phase`. After the hash table be built, `probe phase` assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously.
+
+[Shatdal et al.](http://www.inf.uni-konstanz.de/dbis/teaching/ws0203/main-memory-dbms/download/CCA.pdf) identify that when the hash table is larger than the cache size, almost every access to the
+hash table results in a cache miss. So the `build phase` and `probe phase` of `non-partitioning hash join` would cause considerable cache miss when writing or reading the single global hash table.


... So the build phase and probe phase of non-partitioning hash join might cause considerable cache miss when writing or reading the single global hash table.

CaitinChen · 2018-09-21T03:55:06Z

docs/design/2018-09-21-radix-hashjoin.md

+
+1. Partition algorithm
+
+We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*.


For "This algorithm clusters on the leftmost or rightmost B bits of the integer hash-value and divide a relation into H sub-relations parallel...", if "This algorithm" is the subject and "clusters" is a verb, then you can change this sentence to
This algorithm clusters on the leftmost or rightmost B bits of the integer hash-value and divides a relation into H sub-relations parallelly(?)...

CaitinChen · 2018-09-21T04:00:33Z

docs/design/2018-09-21-radix-hashjoin.md

+
+We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*.
+
+The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.


... to ensure that opened memory pages will be no more than the TLB entries at any point of time...Two or three passes are needed to create cache-sized partitions, since data will be copied...

Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of partition phase, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.

The original algorithm normally would involve two or three passes to create cache-sized partitions, and in each pass, data would be copied, thus it may has high OOM risk. We decide to choose simple 1-pass radix cluster algorithm to avoid OOM risk.

CaitinChen · 2018-09-21T04:04:01Z

docs/design/2018-09-21-radix-hashjoin.md

+
+The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.
+
+Furthermore, `partition phase` can be executed parallelly by subdividing both relations into sub-relations that are assigned to individual threads, but there will be contention when different threads write tuples to the same partition simultaneously, especially when data is skewed heavily. In order to resolve this problem, we scan both input relations twice. The first scan computes a set of histograms over the input data, so the exact output size is known for each thread and each partition. Next, a contiguous memory space is allocated for the output. Finally, all threads perform their partitioning without any synchronize. In addition, we use **[]\*chunk.List** to store every partition, concurrent write to **column.nullBitmap** may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan.


...Finally, all threads perform their partitioning without any synchronization...

CaitinChen · 2018-09-21T06:06:17Z

docs/design/2018-09-21-radix-hashjoin.md

+
+   3.1 We can define the partition as type chunk.List, thus `HashJoinExec` should maintain an attribute **partitions []*chunk.List**.
+
+   3.2 For outer relation, `partition phase` and `probe phase` would be done batch-by-batch. The size of every batch should be configurable for exploring better performance.


would -> will

CaitinChen · 2018-09-21T06:07:29Z

docs/design/2018-09-21-radix-hashjoin.md

+
+Since L3 cache is often shared between multiple processors, we'd better make the size of every partition be no more than L2 cache. The size of L2 cache which is supposed to be *L* can be obtained from third-party library, we can set the size of every partition to *L \* 3/4* to ensure that one partition of inner relation, one hash table and one partition of outer relation fit into the L2 cache when the input data obey the uniform distribution.
+
+The number of partition is controlled by the number of radix bit which is affected by the total size of inner relation. Suppose the total size of inner relation is *S*, the number of radix bit is *B*, we can infer *B = log(S / L)*.


The number of partitions is controlled by the number of radix bits which is affected by the total size of inner relation. Suppose the total size of inner relation is S, the number of radix bits...

CaitinChen · 2018-09-21T06:09:15Z

docs/design/2018-09-21-radix-hashjoin.md

+
+## Rationale
+
+This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`.


This proposal introduces ...

the ratio of cache during

the ratio of cache miss during ...

CaitinChen · 2018-09-21T06:09:40Z

docs/design/2018-09-21-radix-hashjoin.md

+
+This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`.
+
+The side effect is that the memory usage of `HashJoinExec` would increase since the `partition phase` copies the input data. 


would -> might

CaitinChen · 2018-09-21T06:10:11Z

docs/design/2018-09-21-radix-hashjoin.md

+
+## Implementation
+
+1. Investigate the third-party libraries for fetching the L2 cache size and calculate the number of radix bit before `build phase`. @XuHuaiyu


bit -> bits

eurekaka · 2018-09-21T06:56:43Z

docs/design/2018-09-21-radix-hashjoin.md

+
+We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*.
+
+The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.


Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of partition phase, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.

The original algorithm normally would involve two or three passes to create cache-sized partitions, and in each pass, data would be copied, thus it may has high OOM risk. We decide to choose simple 1-pass radix cluster algorithm to avoid OOM risk.

eurekaka · 2018-09-21T07:00:16Z

docs/design/2018-09-21-radix-hashjoin.md

+
+The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.
+
+Furthermore, `partition phase` can be executed parallelly by subdividing both relations into sub-relations that are assigned to individual threads, but there will be contention when different threads write tuples to the same partition simultaneously, especially when data is skewed heavily. In order to resolve this problem, we scan both input relations twice. The first scan computes a set of histograms over the input data, so the exact output size is known for each thread and each partition. Next, a contiguous memory space is allocated for the output. Finally, all threads perform their partitioning without any synchronize. In addition, we use **[]\*chunk.List** to store every partition, concurrent write to **column.nullBitmap** may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan.


by subdividing both relations into sub-relations that are assigned to individual threads

by subdividing partitions into sub-partitions and assigning them to different threads

In addition, we use []*chunk.List to store every partition, concurrent write to column.nullBitmap may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan.

Could you please elaborate more on this? thanks

eurekaka · 2018-09-21T07:07:46Z

docs/design/2018-09-21-radix-hashjoin.md

+
+## Rationale
+
+This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`.


the ratio of cache during

the ratio of cache miss during ...

eurekaka

LGTM

zz-jason

LGTM

docs: add a proposal for radix hash join

f2889da

XuHuaiyu added the proposal label Sep 21, 2018

XuHuaiyu mentioned this pull request Sep 21, 2018

Implement radix hash join #7762

Open

4 tasks

add a link to issue

b4c8276

CaitinChen reviewed Sep 21, 2018

View reviewed changes

eurekaka reviewed Sep 21, 2018

View reviewed changes

address comment

12471c6

eurekaka reviewed Sep 25, 2018

View reviewed changes

zz-jason approved these changes Sep 25, 2018

View reviewed changes

Merge branch 'master' into radix-proposal

71024a2

zz-jason added status/LGT2 Indicates that a PR has LGTM 2. sig/execution SIG execution labels Sep 25, 2018

zz-jason merged commit 257d1b4 into pingcap:master Sep 25, 2018

XuHuaiyu deleted the radix-proposal branch September 25, 2018 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add a proposal for radix hash join #7761

docs: add a proposal for radix hash join #7761

XuHuaiyu commented Sep 21, 2018

XuHuaiyu commented Sep 21, 2018

CaitinChen Sep 21, 2018

CaitinChen Sep 21, 2018

CaitinChen Sep 21, 2018

CaitinChen Sep 21, 2018

eurekaka Sep 21, 2018

CaitinChen Sep 21, 2018

CaitinChen Sep 21, 2018

CaitinChen Sep 21, 2018

CaitinChen Sep 21, 2018

eurekaka Sep 21, 2018

CaitinChen Sep 21, 2018

CaitinChen Sep 21, 2018

eurekaka Sep 21, 2018

eurekaka Sep 21, 2018

eurekaka Sep 21, 2018

eurekaka Sep 21, 2018

eurekaka left a comment

zz-jason left a comment


		## Background

		Currently, TiDB uses `non-partitioning hash join` to implement `HashJoinExec`. This algorithm starts by building a single global hash table for the inner relation which is known as the `build phase`. After the hash table be built, `probe phase` assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously.


		1. Partition algorithm

		We can use the parallel radix cluster algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost B bits of the integer hash-value and divide a relation into H sub-relations parallel, where H = 1 << B.


		We can use the parallel radix cluster algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost B bits of the integer hash-value and divide a relation into H sub-relations parallel, where H = 1 << B.

		The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.


		The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.

		Furthermore, `partition phase` can be executed parallelly by subdividing both relations into sub-relations that are assigned to individual threads, but there will be contention when different threads write tuples to the same partition simultaneously, especially when data is skewed heavily. In order to resolve this problem, we scan both input relations twice. The first scan computes a set of histograms over the input data, so the exact output size is known for each thread and each partition. Next, a contiguous memory space is allocated for the output. Finally, all threads perform their partitioning without any synchronize. In addition, we use *[]\chunk.List to store every partition, concurrent write to column.nullBitmap** may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan.


		3.1 We can define the partition as type chunk.List, thus `HashJoinExec` should maintain an attribute *partitions []chunk.List**.

		3.2 For outer relation, `partition phase` and `probe phase` would be done batch-by-batch. The size of every batch should be configurable for exploring better performance.


		Since L3 cache is often shared between multiple processors, we'd better make the size of every partition be no more than L2 cache. The size of L2 cache which is supposed to be L can be obtained from third-party library, we can set the size of every partition to L \ 3/4* to ensure that one partition of inner relation, one hash table and one partition of outer relation fit into the L2 cache when the input data obey the uniform distribution.

		The number of partition is controlled by the number of radix bit which is affected by the total size of inner relation. Suppose the total size of inner relation is S, the number of radix bit is B, we can infer B = log(S / L).


		## Rationale

		This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`.


		This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`.

		The side effect is that the memory usage of `HashJoinExec` would increase since the `partition phase` copies the input data.


		## Implementation

		1. Investigate the third-party libraries for fetching the L2 cache size and calculate the number of radix bit before `build phase`. @XuHuaiyu

docs: add a proposal for radix hash join #7761

docs: add a proposal for radix hash join #7761

Conversation

XuHuaiyu commented Sep 21, 2018

What problem does this PR solve?

What is changed and how it works?

Check List

XuHuaiyu commented Sep 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eurekaka left a comment

Choose a reason for hiding this comment

zz-jason left a comment

Choose a reason for hiding this comment