Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43768: [C++] Fix the case when boolean_{any|all} meets constant input with length in Acero #43799

Merged
merged 5 commits into from
Sep 2, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 23, 2024

Rationale for this change

See #43768

What changes are included in this PR?

Fix the case when boolean_{any|all} meets constant input with length in Acero

Are these changes tested?

Yes

Are there any user-facing changes?

no

Copy link

⚠️ GitHub issue #43768 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 23, 2024

@zanmato1984 @pitrou @felipecrv would you mind take a look?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 23, 2024
Co-authored-by: Rossi Sun <zanmato1984@gmail.com>
this->all = !scalar.is_valid || checked_cast<const BooleanScalar&>(scalar).value;
this->has_nulls |= !scalar.is_valid;
this->count += scalar.is_valid * batch.length;
this->all &= !scalar.is_valid || checked_cast<const BooleanScalar&>(scalar).value;
return Status::OK();
}
const ArraySpan& data = batch[0].array;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't change has_null = to has_null |= as what you did for any?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And my question is why make it |= instead of simply =?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is in #43799 (comment)

Copy link
Collaborator

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

this->has_nulls = !scalar.is_valid;
this->any = scalar.is_valid && checked_cast<const BooleanScalar&>(scalar).value;
this->count += scalar.is_valid;
this->has_nulls |= !scalar.is_valid;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? If the argument is a scalar, isn't it as if the whole batch was made from repeated values of that scalar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If previous is has_null, it always has_null ( that's why using |= )
  2. !scalar.is_valid means has-null, so it would set by this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @mapleFU is right here. this->has_nulls is meant to represent whether any null was encountered, so it should never change from true to false.

this->any = scalar.is_valid && checked_cast<const BooleanScalar&>(scalar).value;
this->count += scalar.is_valid;
this->has_nulls |= !scalar.is_valid;
this->any |= scalar.is_valid && checked_cast<const BooleanScalar&>(scalar).value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the |? Isn't that necessary only in MergeFrom?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this->count += scalar.is_valid;
this->has_nulls |= !scalar.is_valid;
this->any |= scalar.is_valid && checked_cast<const BooleanScalar&>(scalar).value;
this->count += scalar.is_valid * batch.length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I understand.

this->all = !scalar.is_valid || checked_cast<const BooleanScalar&>(scalar).value;
this->has_nulls |= !scalar.is_valid;
this->count += scalar.is_valid * batch.length;
this->all &= !scalar.is_valid || checked_cast<const BooleanScalar&>(scalar).value;
return Status::OK();
}
const ArraySpan& data = batch[0].array;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And my question is why make it |= instead of simply =?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 24, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Aug 24, 2024

RETURN_NOT_OK(kernels_[i]->consume(&batch_ctx, column_batch));

@felipecrv The problem is here, the ScalarAggregateExecutor would do this well, but acero would continue using consume rather than handle each batch separately

@mapleFU mapleFU requested a review from felipecrv August 26, 2024 02:59
@felipecrv
Copy link
Contributor

RETURN_NOT_OK(kernels_[i]->consume(&batch_ctx, column_batch));

@felipecrv The problem is here, the ScalarAggregateExecutor would do this well, but acero would continue using consume rather than handle each batch separately

Isn't that a problem in itself?

@mapleFU
Copy link
Member Author

mapleFU commented Aug 26, 2024

Isn't that a problem in itself?

IMO we don't declare that SCALAR_AGGREGATE is for "each call consume once, and merge into a final state". So we'd better not expect that?

/// \brief Kernel data structure for implementations of
/// ScalarAggregateFunction. The four necessary components of an aggregation
/// kernel are the init, consume, merge, and finalize functions.
///
/// * init: creates a new KernelState for a kernel.
/// * consume: processes an ExecSpan and updates the KernelState found in the
///   KernelContext.
/// * merge: combines one KernelState with another.
/// * finalize: produces the end result of the aggregation using the
///   KernelState in the KernelContext.
struct ARROW_EXPORT ScalarAggregateKernel : public Kernel {

The ScalarAggregateKernel says "merge" and "consume" does their own thing, but not saying that "every consume would be call once and merge to a function", which is the logic in ScalarAggregateExecutor

I would also glad to put the acero with other method but I think modify Boolean{Any|All} would be ok and no more cost?

@felipecrv

@felipecrv
Copy link
Contributor

Makes sense @mapleFU. I'm learning. :)

Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. But what you think @bkietz?

@mapleFU
Copy link
Member Author

mapleFU commented Aug 29, 2024

@pitrou @bkietz would you mind take a look?

Comment on lines 218 to 219
std::vector<ExecBatch> batches{
ExecBatchFromJSON({int32()}, "[[42], [42], [42], [42]]")};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to test more cases? Something like:

  struct AnyAllCase {
    std::string batches_json;
    std::string expected_json;
  };
  std::vector<AnyAllCase> cases {
    {"[[42], []]", "[[false]]"},
    {"[[42, 42], []]", "[[true]]"},
    {"[[42], [42]]", "[[true]]"},
    {"[[42], [42, 42]]", "[[true]]"},
    {"[[42], []]", "[[false]]"},
    {"[[42, 42], []]", "[[true]]"},
    {"[[42], [42]]", "[[true]]"},
    {"[[42], [42, 42]]", "[[true]]"},
  };

Also vary the skip_nulls and the boolean literal.

Copy link
Member Author

@mapleFU mapleFU Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard to build test like this since the Literal column would only built by Projection here, but I've tried

Project cannot make "different" literal in this scenerio

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Would you mind take a quick look that whether the testing is ok?

@pitrou pitrou changed the title GH-43768: Fix the case when boolean_{any|all} meets constant input with length in Acero GH-43768: [C++] Fix the case when boolean_{any|all} meets constant input with length in Acero Sep 2, 2024
@mapleFU mapleFU merged commit 9cafbb2 into apache:main Sep 2, 2024
38 of 41 checks passed
@mapleFU mapleFU removed the awaiting change review Awaiting change review label Sep 2, 2024
@mapleFU mapleFU deleted the boolean-agg-fix branch September 2, 2024 15:42
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 9cafbb2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

mapleFU added a commit to mapleFU/arrow that referenced this pull request Sep 3, 2024
…ant input with length in Acero (apache#43799)

### Rationale for this change

See apache#43768

### What changes are included in this PR?

Fix the case when boolean_{any|all} meets constant input with length in Acero

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#43768

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 added a commit to zanmato1984/arrow that referenced this pull request Sep 6, 2024
…ant input with length in Acero (apache#43799)

### Rationale for this change

See apache#43768

### What changes are included in this PR?

Fix the case when boolean_{any|all} meets constant input with length in Acero

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#43768

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…ant input with length in Acero (apache#43799)

### Rationale for this change

See apache#43768

### What changes are included in this PR?

Fix the case when boolean_{any|all} meets constant input with length in Acero

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#43768

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants