Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add a proposal for radix hash join #7761

Merged
merged 4 commits into from
Sep 25, 2018

Conversation

XuHuaiyu
Copy link
Contributor

What problem does this PR solve?

Add a proposal for implementing radix hash join.

What is changed and how it works?

Check List

Tests

  • No code

Code changes

Side effects
none

Related changes

@XuHuaiyu
Copy link
Contributor Author

PTAL @zz-jason @winoros @eurekaka

@XuHuaiyu XuHuaiyu mentioned this pull request Sep 21, 2018
4 tasks

## Background

Currently, TiDB uses `non-partitioning hash join` to implement `HashJoinExec`. This algorithm starts by building a **single global hash table** for the inner relation which is known as the `build phase`. After the hash table be built, `probe phase` assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> ... After the hash table is built, probe phase assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously.

Currently, TiDB uses `non-partitioning hash join` to implement `HashJoinExec`. This algorithm starts by building a **single global hash table** for the inner relation which is known as the `build phase`. After the hash table be built, `probe phase` assigns equal-sized sub-relations (chunks) of the outer relation to multiple threads to probe against the global hash table simultaneously.

[Shatdal et al.](http://www.inf.uni-konstanz.de/dbis/teaching/ws0203/main-memory-dbms/download/CCA.pdf) identify that when the hash table is larger than the cache size, almost every access to the
hash table results in a cache miss. So the `build phase` and `probe phase` of `non-partitioning hash join` would cause considerable cache miss when writing or reading the single global hash table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... So the build phase and probe phase of non-partitioning hash join might cause considerable cache miss when writing or reading the single global hash table.


1. Partition algorithm

We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "This algorithm clusters on the leftmost or rightmost B bits of the integer hash-value and divide a relation into H sub-relations parallel...", if "This algorithm" is the subject and "clusters" is a verb, then you can change this sentence to
This algorithm clusters on the leftmost or rightmost B bits of the integer hash-value and divides a relation into H sub-relations parallelly(?)...


We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*.

The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... to ensure that opened memory pages will be no more than the TLB entries at any point of time...Two or three passes are needed to create cache-sized partitions, since data will be copied...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of partition phase, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.

The original algorithm normally would involve two or three passes to create cache-sized partitions, and in each pass, data would be copied, thus it may has high OOM risk. We decide to choose simple 1-pass radix cluster algorithm to avoid OOM risk.


The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.

Furthermore, `partition phase` can be executed parallelly by subdividing both relations into sub-relations that are assigned to individual threads, but there will be contention when different threads write tuples to the same partition simultaneously, especially when data is skewed heavily. In order to resolve this problem, we scan both input relations twice. The first scan computes a set of histograms over the input data, so the exact output size is known for each thread and each partition. Next, a contiguous memory space is allocated for the output. Finally, all threads perform their partitioning without any synchronize. In addition, we use **[]\*chunk.List** to store every partition, concurrent write to **column.nullBitmap** may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...Finally, all threads perform their partitioning without any synchronization...


3.1 We can define the partition as type chunk.List, thus `HashJoinExec` should maintain an attribute **partitions []*chunk.List**.

3.2 For outer relation, `partition phase` and `probe phase` would be done batch-by-batch. The size of every batch should be configurable for exploring better performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would -> will


Since L3 cache is often shared between multiple processors, we'd better make the size of every partition be no more than L2 cache. The size of L2 cache which is supposed to be *L* can be obtained from third-party library, we can set the size of every partition to *L \* 3/4* to ensure that one partition of inner relation, one hash table and one partition of outer relation fit into the L2 cache when the input data obey the uniform distribution.

The number of partition is controlled by the number of radix bit which is affected by the total size of inner relation. Suppose the total size of inner relation is *S*, the number of radix bit is *B*, we can infer *B = log(S / L)*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of partitions is controlled by the number of radix bits which is affected by the total size of inner relation. Suppose the total size of inner relation is S, the number of radix bits...


## Rationale

This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal introduces ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ratio of cache during

the ratio of cache miss during ...


This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`.

The side effect is that the memory usage of `HashJoinExec` would increase since the `partition phase` copies the input data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would -> might


## Implementation

1. Investigate the third-party libraries for fetching the L2 cache size and calculate the number of radix bit before `build phase`. @XuHuaiyu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit -> bits


We can use the **parallel radix cluster** algorithm raised by [Balkesen et al.](https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf) to do the partitioning work. This algorithm clusters on the leftmost or rightmost **B bits** of the integer hash-value and divide a relation into **H sub-relations** parallel, where *H = 1 << B*.

The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of partition phase, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.

The original algorithm normally would involve two or three passes to create cache-sized partitions, and in each pass, data would be copied, thus it may has high OOM risk. We decide to choose simple 1-pass radix cluster algorithm to avoid OOM risk.


The original algorithm does the clustering job using multiple passes and each pass looks at a different set of bits, to ensure that opened memory pages would be no more than the TLB entries at any point of time. As a result, the TLB miss, caused by randomly writing tuples to a large number of partitions, can be avoided. Two or three passes are needed to create cache-sized partitions, since data would be copied during every pass of `partition phase`, and the possibility of OOM may increase. So we choose 1-pass radix cluster algorithm to avoid the risk of OOM.

Furthermore, `partition phase` can be executed parallelly by subdividing both relations into sub-relations that are assigned to individual threads, but there will be contention when different threads write tuples to the same partition simultaneously, especially when data is skewed heavily. In order to resolve this problem, we scan both input relations twice. The first scan computes a set of histograms over the input data, so the exact output size is known for each thread and each partition. Next, a contiguous memory space is allocated for the output. Finally, all threads perform their partitioning without any synchronize. In addition, we use **[]\*chunk.List** to store every partition, concurrent write to **column.nullBitmap** may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by subdividing both relations into sub-relations that are assigned to individual threads

by subdividing partitions into sub-partitions and assigning them to different threads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we use []*chunk.List to store every partition, concurrent write to column.nullBitmap may also raise contention problem during the second scan. So we also need to pre-write the null-value info during the first scan.

Could you please elaborate more on this? thanks


## Rationale

This proposal introduce a `partition phase` into `HashJoinExec` to partition the input relations into small partitions where one of the partition typically fits into the L2 cache. Through this, the ratio of cache during `build phase` and `probe phase` can be reduced and thus improve the total performance of `HashJoinExec`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ratio of cache during

the ratio of cache miss during ...

Copy link
Contributor

@eurekaka eurekaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@zz-jason zz-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zz-jason zz-jason added status/LGT2 Indicates that a PR has LGTM 2. sig/execution SIG execution labels Sep 25, 2018
@zz-jason zz-jason merged commit 257d1b4 into pingcap:master Sep 25, 2018
@XuHuaiyu XuHuaiyu deleted the radix-proposal branch September 25, 2018 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal sig/execution SIG execution status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants