Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13222: [C++] Improve type support for case_when #10806

Closed
wants to merge 31 commits into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Jul 26, 2021

This adds support for nearly everything except dictionaries and extension types. Binary types, maps, lists (fixed-size and variable), unions, and structs are supported. The benchmark is also extended to show that performance is not entirely terrible (the first attempt using purely AppendScalar ran at ~50MB/s, this runs at a couple GB/s similar to the other kernel cases). This does leverage AppendScalar for the scalar cases.

Dictionaries are still supported, but will be unpacked. For direct support there's work described in ARROW-13573, but I'd like to validate my approach here before making the changes necessary.

@github-actions
Copy link

@lidavidm lidavidm marked this pull request as draft August 3, 2021 15:33
@lidavidm lidavidm marked this pull request as ready for review August 3, 2021 19:17
@lidavidm lidavidm marked this pull request as draft August 4, 2021 14:57
@lidavidm
Copy link
Member Author

lidavidm commented Aug 4, 2021

Moved back to draft, need to implement full dictionary support as described in #10724 (comment)

@lidavidm lidavidm marked this pull request as ready for review August 5, 2021 14:24
@lidavidm
Copy link
Member Author

lidavidm commented Aug 5, 2021

Moved out of draft - dictionary support will need more work and I want to validate the approach here first. Filed ARROW-13573 to cover that work.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, especially the thorough testing. Here are a couple comments.

@@ -1413,6 +1417,588 @@ struct CaseWhenFunctor<NullType> {
}
};

static Status ExecVarWidthScalarCaseWhen(KernelContext* ctx, const ExecBatch& batch,
Datum* out) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return Result<datum>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's minor but 1) we use the shape of out to determine what to return below and 2) it fits slightly better with the rest of the APIs as they all use the Status(..., Datum* out) pattern.


static Status GetValueAppenders(const DataType& type, ArrayAppenderFunc* array_appender);

struct GetAppenders {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this functionality may be useful elsewhere (and actually, perhaps it can already be reused in other places).

Why not add a virtual method to ArrayBuilder (e.g. virtual AppendArraySlice(...)) instead of keeping this private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating this. I'll refactor this to be part of the builder instead.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments. Thanks a lot for the update!

" but got array of type ", *array.type);
}
return AppendArraySliceUnchecked(array, offset, length);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would not bother with the checked version personally. Just mention in the docstring that the types are supposed to be identical.

const int64_t length) {
return Status::NotImplemented("AppendArraySliceUnchecked for builder for ", *type());
}
// TODO: overloads for arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either do it just here, or just remove the TODO, IMHO. It's not worth keeping this comment for a trivial two-liner.

}
const uint8_t* validity = array.MayHaveNulls() ? array.buffers[0]->data() : nullptr;
for (int64_t row = offset; row < offset + length; row++) {
RETURN_NOT_OK(Append(!validity || BitUtil::GetBit(validity, array.offset + row)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just do a bulk-append of the validity bitmap here?

@@ -155,6 +156,21 @@ class ARROW_EXPORT DenseUnionBuilder : public BasicUnionBuilder {
return offsets_builder_.Append(offset);
}

Status AppendArraySliceUnchecked(const ArrayData& array, const int64_t offset,
const int64_t length) override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition can probably moved to builder_union.cc instead?
(same for SparseUnionBuilder)

out_->buffers[1] = buffer_;
// buffer_ is zeroed, but 0 may not be a valid type code
if (type.type_codes()[0] != 0) {
ARROW_ASSIGN_OR_RAISE(out_->buffers[1], AllocateBuffer(buffer_->size(), pool_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length_ rather than buffer_->size()? (can buffer_ be larger because of the children or offsets?)

if (type.type_codes()[0] != 0) {
ARROW_ASSIGN_OR_RAISE(out_->buffers[1], AllocateBuffer(buffer_->size(), pool_));
std::memset(out_->buffers[1]->mutable_data(), type.type_codes()[0],
buffer_->size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length_ here too?

auto values1 = ArrayFromJSON(type, R"(["cDE", null, "degfhi", "efg"])");
auto values2 = ArrayFromJSON(type, R"(["fghijk", "ghi", null, "hi"])");

// CheckScalar("case_when", {MakeStruct({}), values1}, values1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to uncomment all these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I pushed something half WIP, I'll comment again once things are ready. Thanks for taking a look.

/// \brief Append a range of values from an array.
///
/// The given array must be the same type as the builder.
virtual Status AppendArraySliceUnchecked(const ArrayData& array, int64_t offset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop "Unchecked" from the name, I think, as long as the contract is documented.

@lidavidm
Copy link
Member Author

Hmm, there's a Windows R/MinGW-specific failure :/

@lidavidm
Copy link
Member Author

Alright, well it's crashing on Windows because the std::function introduced into ExecVarWidthArrayCaseWhen contains not the lambda, but some completely random function which stomps all over the stack. That's about as far as I'm going to get today (skipping the call to reserve_data lets the test pass just fine).

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Can you also update the documented input types in the Sphinx docs?

@lidavidm
Copy link
Member Author

Thanks. I updated the docs, I left a note about dictionary types that will be updated in ARROW-13573.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants