-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13222: [C++] Improve type support for case_when #10806
Conversation
Moved back to draft, need to implement full dictionary support as described in #10724 (comment) |
Moved out of draft - dictionary support will need more work and I want to validate the approach here first. Filed ARROW-13573 to cover that work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, especially the thorough testing. Here are a couple comments.
@@ -1413,6 +1417,588 @@ struct CaseWhenFunctor<NullType> { | |||
} | |||
}; | |||
|
|||
static Status ExecVarWidthScalarCaseWhen(KernelContext* ctx, const ExecBatch& batch, | |||
Datum* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return Result<datum>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's minor but 1) we use the shape of out to determine what to return below and 2) it fits slightly better with the rest of the APIs as they all use the Status(..., Datum* out)
pattern.
|
||
static Status GetValueAppenders(const DataType& type, ArrayAppenderFunc* array_appender); | ||
|
||
struct GetAppenders { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... this functionality may be useful elsewhere (and actually, perhaps it can already be reused in other places).
Why not add a virtual method to ArrayBuilder
(e.g. virtual AppendArraySlice(...)
) instead of keeping this private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debating this. I'll refactor this to be part of the builder instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments. Thanks a lot for the update!
cpp/src/arrow/array/builder_base.h
Outdated
" but got array of type ", *array.type); | ||
} | ||
return AppendArraySliceUnchecked(array, offset, length); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would not bother with the checked version personally. Just mention in the docstring that the types are supposed to be identical.
cpp/src/arrow/array/builder_base.h
Outdated
const int64_t length) { | ||
return Status::NotImplemented("AppendArraySliceUnchecked for builder for ", *type()); | ||
} | ||
// TODO: overloads for arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either do it just here, or just remove the TODO, IMHO. It's not worth keeping this comment for a trivial two-liner.
cpp/src/arrow/array/builder_nested.h
Outdated
} | ||
const uint8_t* validity = array.MayHaveNulls() ? array.buffers[0]->data() : nullptr; | ||
for (int64_t row = offset; row < offset + length; row++) { | ||
RETURN_NOT_OK(Append(!validity || BitUtil::GetBit(validity, array.offset + row))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just do a bulk-append of the validity bitmap here?
cpp/src/arrow/array/builder_union.h
Outdated
@@ -155,6 +156,21 @@ class ARROW_EXPORT DenseUnionBuilder : public BasicUnionBuilder { | |||
return offsets_builder_.Append(offset); | |||
} | |||
|
|||
Status AppendArraySliceUnchecked(const ArrayData& array, const int64_t offset, | |||
const int64_t length) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition can probably moved to builder_union.cc
instead?
(same for SparseUnionBuilder)
cpp/src/arrow/array/util.cc
Outdated
out_->buffers[1] = buffer_; | ||
// buffer_ is zeroed, but 0 may not be a valid type code | ||
if (type.type_codes()[0] != 0) { | ||
ARROW_ASSIGN_OR_RAISE(out_->buffers[1], AllocateBuffer(buffer_->size(), pool_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length_
rather than buffer_->size()
? (can buffer_
be larger because of the children or offsets?)
cpp/src/arrow/array/util.cc
Outdated
if (type.type_codes()[0] != 0) { | ||
ARROW_ASSIGN_OR_RAISE(out_->buffers[1], AllocateBuffer(buffer_->size(), pool_)); | ||
std::memset(out_->buffers[1]->mutable_data(), type.type_codes()[0], | ||
buffer_->size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length_
here too?
auto values1 = ArrayFromJSON(type, R"(["cDE", null, "degfhi", "efg"])"); | ||
auto values2 = ArrayFromJSON(type, R"(["fghijk", "ghi", null, "hi"])"); | ||
|
||
// CheckScalar("case_when", {MakeStruct({}), values1}, values1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you forget to uncomment all these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I pushed something half WIP, I'll comment again once things are ready. Thanks for taking a look.
…ceUnchecked" This reverts commit fe9ecbf0a35117db7ad955ef345181328633277b.
cpp/src/arrow/array/builder_base.h
Outdated
/// \brief Append a range of values from an array. | ||
/// | ||
/// The given array must be the same type as the builder. | ||
virtual Status AppendArraySliceUnchecked(const ArrayData& array, int64_t offset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop "Unchecked" from the name, I think, as long as the contract is documented.
Hmm, there's a Windows R/MinGW-specific failure :/ |
Alright, well it's crashing on Windows because the std::function introduced into ExecVarWidthArrayCaseWhen contains not the lambda, but some completely random function which stomps all over the stack. That's about as far as I'm going to get today (skipping the call to reserve_data lets the test pass just fine). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Can you also update the documented input types in the Sphinx docs?
Thanks. I updated the docs, I left a note about dictionary types that will be updated in ARROW-13573. |
This adds support for nearly everything except dictionaries and extension types. Binary types, maps, lists (fixed-size and variable), unions, and structs are supported. The benchmark is also extended to show that performance is not entirely terrible (the first attempt using purely AppendScalar ran at ~50MB/s, this runs at a couple GB/s similar to the other kernel cases). This does leverage AppendScalar for the scalar cases.
Dictionaries are still supported, but will be unpacked. For direct support there's work described in ARROW-13573, but I'd like to validate my approach here before making the changes necessary.