Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(streaming): introduce cache policy for hash join #3983

Closed
wants to merge 6 commits into from

Conversation

yuhao-su
Copy link
Contributor

@yuhao-su yuhao-su commented Jul 18, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

PLEASE DO NOT LEAVE THIS EMPTY !!!

Previously our hash join executor only populate cache on reading a join key. Now we allow users to specify whether to populate cache on read or on both read an write.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

To populate cache on read (default setting) use the following statement:

SET RW_STREAMING_HASH_JOIN_POLICY TO OnRead;

To read cache on both read and write use the following statement:

SET RW_STREAMING_HASH_JOIN_POLICY TO OnReadWrite;

Release note

Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.

Refer to a related PR or issue link (optional)

@codecov
Copy link

codecov bot commented Jul 18, 2022

Codecov Report

Merging #3983 (30b06c8) into main (66e37d8) will decrease coverage by 0.00%.
The diff coverage is 73.68%.

@@            Coverage Diff             @@
##             main    #3983      +/-   ##
==========================================
- Coverage   73.88%   73.88%   -0.01%     
==========================================
  Files         828      828              
  Lines      117044   117143      +99     
==========================================
+ Hits        86481    86551      +70     
- Misses      30563    30592      +29     
Flag Coverage Δ
rust 73.88% <73.68%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/stream/src/executor/mod.rs 53.70% <ø> (ø)
src/stream/src/from_proto/hash_join.rs 0.00% <0.00%> (ø)
src/common/src/session_config/mod.rs 31.41% <25.00%> (-0.54%) ⬇️
src/stream/src/executor/managed_state/join/mod.rs 87.62% <54.16%> (-4.56%) ⬇️
src/common/src/session_config/config_types.rs 86.41% <86.41%> (ø)
...ontend/src/optimizer/plan_node/stream_hash_join.rs 91.35% <100.00%> (+0.34%) ⬆️
src/stream/src/executor/hash_join.rs 97.02% <100.00%> (+<0.01%) ⬆️
src/frontend/src/expr/literal.rs 91.20% <0.00%> (-1.10%) ⬇️
src/common/src/types/ordered_float.rs 29.10% <0.00%> (+0.19%) ⬆️
... and 1 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@yuhao-su yuhao-su requested a review from BugenZhao July 19, 2022 03:38
@fuyufjh
Copy link
Member

fuyufjh commented Jul 19, 2022

  1. IIUC, your goal is to save a write, but actually since there could be more than 1 rows in a join entry, either OnRead or OnReadWrite needs to use a read to populate a join entry entirely, so I didn't see any difference. Even worse, if the join is between a fact table and a dimension table, populating cache on the fact table side is likely to be useless, because there are very few changes on the dimension table side.
  2. I don't like the idea of configurable RW_STREAMING_HASH_JOIN_POLICY because users must never understand these internal details and nobody will use this config in the future. Please just select a generally good policy and remove others.

@yuhao-su
Copy link
Contributor Author

Preliminary benching (tpch q12) shows both policy produced similar result. However we are currently facing a compactor OOM issue that causes performance deterioration within 5 min. I will bench it later after #3921 solved.

@yuhao-su yuhao-su marked this pull request as draft July 21, 2022 09:19
@github-actions
Copy link
Contributor

This PR has been open for 60 days with no activity. Could you please update the status? Feel free to ping a reviewer if you are waiting for review.

@lmatz
Copy link
Contributor

lmatz commented Sep 27, 2022

Will this PR be part of #5395 or #5395 will make this PR obsolete?

@yuhao-su
Copy link
Contributor Author

Will this PR be part of #5395 or #5395 will make this PR obsolete?

No. They are not related.

@TennyZhuang TennyZhuang closed this Jan 4, 2023
@TennyZhuang
Copy link
Contributor

Close due to outdated and controversies.

@lmatz lmatz deleted the cache_policy_for_hash_join branch January 16, 2023 02:21
@lmatz lmatz restored the cache_policy_for_hash_join branch January 16, 2023 02:48
@xxchan xxchan deleted the cache_policy_for_hash_join branch May 14, 2023 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants