Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36092: [C++] Simplify concurrency in as-of-join node #36094

Merged
merged 1 commit into from
Jun 22, 2023

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Jun 15, 2023

What changes are included in this PR?

The key hasher invalidation and memo-store time updating are moved to the processing thread.

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

No.

@rtpsw
Copy link
Contributor Author

rtpsw commented Jun 15, 2023

cc @westonpace, @icexelloss

@rtpsw
Copy link
Contributor Author

rtpsw commented Jun 16, 2023

cc @westonpace, @icexelloss

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Let me verify I understand:

Before, on InputReceived, we would update the memo for an input based on the new batch.

Now, the only thing that really happens on InputReceived, is the batch is placed into the appropriate input queue. The update to the memo table happens when teh processing thread runs, as it is processing batches for an input.

Comment on lines +375 to +376
bool update = current_time_ < ts;
if (update) current_time_ = ts;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool update = current_time_ < ts;
if (update) current_time_ = ts;
current_time_ = std::min(current_time_, ts);

Minor nit: Maybe simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method still needs to return a Boolean, so we can't remove the bool update line, but we can replace the if-statement.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jun 21, 2023
@icexelloss
Copy link
Contributor

This makes sense to me. Let me verify I understand:

Before, on InputReceived, we would update the memo for an input based on the new batch.

Now, the only thing that really happens on InputReceived, is the batch is placed into the appropriate input queue. The update to the memo table happens when teh processing thread runs, as it is processing batches for an input.

Yes exactly

@icexelloss
Copy link
Contributor

@rtpsw Looks like you added another atomic variable here:
https://github.com/apache/arrow/pull/35874/files#r1218191706

Can you fix that one too?

@icexelloss
Copy link
Contributor

@rtpsw Looks like you added another atomic variable here: https://github.com/apache/arrow/pull/35874/files#r1218191706

Can you fix that one too?

Nvm looks like you fixed this

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rtpsw
Copy link
Contributor Author

rtpsw commented Jun 22, 2023

This makes sense to me. Let me verify I understand:

Before, on InputReceived, we would update the memo for an input based on the new batch.

Now, the only thing that really happens on InputReceived, is the batch is placed into the appropriate input queue. The update to the memo table happens when teh processing thread runs, as it is processing batches for an input.

As @icexelloss said, this is correct. However, there is another difference to note. The pre-PR code invalidates the key hasher and updates the memo-store time given any batch, whereas the post-PR code does so given all but the first batch (after batches_processed_ is incremented). This is fine because the key hasher is initialized as invalid and the memo-store time is initialized to the lowest value.

@icexelloss icexelloss merged commit 1b29d95 into apache:main Jun 22, 2023
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 1b29d958.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Simplify concurrency in as-of-join node
3 participants