-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13573: [C++] Support dictionaries natively in case_when #11022
Conversation
9746270
to
7d3aeae
Compare
f6a7a85
to
a93ce99
Compare
One thought: we could have all dictionary types use the variable-width type implementation, meaning we'd always unify dictionaries. This would behave a little more consistently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. I haven't looked fully at the implementation yet.
cpp/src/arrow/array/builder_dict.h
Outdated
@@ -282,6 +294,163 @@ class DictionaryBuilderBase : public ArrayBuilder { | |||
return indices_builder_.AppendEmptyValues(length); | |||
} | |||
|
|||
Status AppendScalar(const Scalar& scalar, int64_t n_repeats) override { | |||
if (!scalar.type->Equals(type())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to do this check every append or should this be left to callers?
cpp/src/arrow/array/builder_dict.h
Outdated
switch (dict_ty.index_type()->id()) { | ||
case Type::UINT8: { | ||
const auto& value = dict.GetView( | ||
internal::checked_cast<const UInt8Scalar&>(*dict_scalar.value.index).value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if dict
has a null at this index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there should be better testing for nulls in general, I'll amend that.
cpp/src/arrow/array/builder_dict.h
Outdated
array.buffers[0], array.offset + offset, std::min(array.length, length), | ||
[&](int64_t position) { return Append(dict.GetView(values[position])); }, | ||
[&]() { return AppendNull(); }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to factor this out to avoid repetition? For example:
template <IndexType>
struct SliceAppender {
const IndexType* values;
Status operator()(const ArrayData& array, int64_t offset, int64_t length) {
return VisitBitBlocks(
array.buffers[0], array.offset + offset, length,
[&](int64_t position) {
if (dict.IsNull(values[position])) return AppendNull();
return Append(dict.GetView(values[position]));
},
[&]() { return AppendNull(); }); }
);
}
}
case Type::UINT8:
return SliceAppender{array.GetValues<uint8_t>(1) + offset}(array, offset, length);
// ...
cpp/src/arrow/array/builder_dict.h
Outdated
case Type::UINT8: { | ||
const uint8_t* values = array.GetValues<uint8_t>(1) + offset; | ||
return VisitBitBlocks( | ||
array.buffers[0], array.offset + offset, std::min(array.length, length), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other AppendArraySlice
implementations don't check that length
is in bounds, so std::min
doesn't seem necessary here.
auto values1 = make_list(ArrayFromJSON(int32(), "[0, 2, 2, 3, 4]"), values1_backing); | ||
auto values2 = make_list(ArrayFromJSON(int32(), "[0, 1, 2, 2, 4]"), values2_backing); | ||
|
||
CheckScalarNonRecursive( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this calling CheckScalarNonRecursive
and not CheckScalar
? Leave a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scalar variant of the kernel will not produce the same dictionary indices so the values do not compare equal. I'll add a comment to that effect.
auto values1_null = DictArrayFromJSON(type, "[null, null, null, null]", dict1); | ||
auto values2_null = DictArrayFromJSON(type, "[null, null, null, null]", dict2); | ||
auto values1 = DictArrayFromJSON(type, "[0, null, 3, 1]", dict1); | ||
auto values2 = DictArrayFromJSON(type, "[2, 1, null, 0]", dict2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, it looks like the nulls in the indices are placed at the same indices as the nulls in the respective dictionaries.
DictArrayFromJSON(type, "[0, 0, 2, 2]", dict1)); | ||
|
||
// If we can't map values from a dictionary, then raise an error | ||
// Unmappable value is in the else clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: why don't we unify dictionaries instead? It would sound more useful to me. I don't see any reason for the first input to have a particular status, is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had mostly tried to emulate the R/dplyr behavior as closely as possible: #10724 (comment)
But unification is honestly probably easier to implement for us, so I can switch to that instead.
|
||
// ...or optionally, emit null | ||
|
||
// TODO: this is not implemented yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what this TODO is for. Emitting a null when some option is enabled?
@@ -1058,6 +1062,109 @@ void AddFSBinaryIfElseKernel(const std::shared_ptr<IfElseFunction>& scalar_funct | |||
DCHECK_OK(scalar_function->AddKernel(std::move(kernel))); | |||
} | |||
|
|||
// Given a reference dictionary, computes indices to map dictionary values from a | |||
// comparison dictionary to the reference. | |||
class DictionaryRemapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we already have DictionaryUnifier
for this? Or am I misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for if we don't want unification, however, I think we might want to just unify dictionaries always.
IIRC, what DictionaryUnifier was missing was a way to compute a transposition map without adding new values to the internal memo table.
a93ce99
to
0759290
Compare
Changes:
|
2b7f761
to
891bfb6
Compare
It looks like the RTools 40 test failure/crash is real; I'm going to need to figure out how to replicate this properly. (So far I've had little success with a VM, unfortunately.) |
5484a35
to
4e8d34a
Compare
It looks like there's still 2 Windows failures to look into (a segfault in MinGW/32, which is hopefully more debuggable, and a failure to run one of the examples in RTools35), though the RTools40 crash is no more. |
Hmm... did you make sense of the RTools 3.5 CI failure? I can't find the actual error in the logs :-/ |
I'm setting up my Windows VM again since it expires every few months now :/ but it seems like the dataset example crashed when it was run, looking here: https://github.com/apache/arrow/pull/11022/checks?check_run_id=3589141006#step:12:395 |
Is that example expected to be impacted by this PR? Otherwise, perhaps we should just restart the build... |
I do not expect it to be impacted but it was also failing in the last couple builds. (That said, I think it wasn't failing before I turned off the unity build?) |
And I did finally get the R package built on Windows - |
d171c73
to
2bdee00
Compare
2bdee00
to
15a64a2
Compare
The CI is passing here, for the first time in quite a while. |
else | ||
export ARROW_S3=ON | ||
export ARROW_WITH_RE2=ON | ||
# Without this, some compute functionality segfaults | ||
export CMAKE_UNITY_BUILD=OFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean it segfaults during compilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It segfaults in the tests. I wasn't really able to debug this on Windows; it disappears once you build with debuginfo.
if (index_scalar.is_valid && dict.IsValid(index)) { | ||
const auto& value = dict.GetView(index); | ||
for (int64_t i = 0; i < n_repeats; i++) { | ||
ARROW_RETURN_NOT_OK(Append(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but it sounds like offering a two-step API on DictionaryBuilder would allow for performance improvements:
/// Ensure `value` is in the dict, and return its index, but doesn't append it
Result<int64_t> Encode(c_type value);
/// Append the given dictionary index
Status AppendIndex(int64_t index);
Status AppendIndices(int64_t index, int64_t nrepeats);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed ARROW-14042.
} | ||
EXPECT_OK_AND_ASSIGN(Datum expected, CallFunction(func_name, decoded_args)); | ||
|
||
if (actual.type()->id() == Type::DICTIONARY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it would be nice if the caller actually said whether the output is supposed to be dictionary-encoded or not. Otherwise there could be silent regressions where the output type of a kernel changes from one version to another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean add an options struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckDictionary
could accept an argument saying if the expected output is dictionary-encoded or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean now - will do.
for (auto index : {"null", "2", "1", "0"}) { | ||
auto scalar = DictScalarFromJSON(type, index, dict); | ||
auto expected_index = ScalarFromJSON(int32(), index); | ||
AssertScalarsEqual(*DictionaryScalar::Make(expected_index, expected_dictionary), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also call scalar->ValidateFull()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you !
This supports dictionaries 'natively', that is, dictionaries are no longer always unpacked. (If mixed dictionary and non-dictionary arguments are given, then they will be unpacked.) For scalar conditions, the output will have the dictionary of whichever input is selected (or no dictionary if the output is null). For array conditions, we unify the dictionaries as we select elements. Closes apache#11022 from lidavidm/arrow-13573 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This supports dictionaries 'natively', that is, dictionaries are no longer always unpacked. (If mixed dictionary and non-dictionary arguments are given, then they will be unpacked.)
For scalar conditions, the output will have the dictionary of whichever input is selected (or no dictionary if the output is null). For array conditions, we unify the dictionaries as we select elements.