Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace RawPtrBox with ScalarBuffer, reduce unsafe usage (#1811) #1825

Closed
wants to merge 2 commits into from

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 9, 2022

Fix value_offsets for empty variable length arrays (#1824)

Which issue does this PR close?

Closes #1811
Closes #1824

Rationale for this change

The previous logic was more unsafe, and in fact implementing this PR found a soundness bug in the form of #1824, so already got some ROI 🎉

What changes are included in this PR?

Removes RawPtrBox and changes to using ScalarBuffer

Are there any user-facing changes?

Yes, value_offsets will now always return an empty slice for zero-length lists

Fix value_offsets for empty variable length arrays (apache#1824)
@tustvold tustvold added the api-change Changes to the arrow API label Jun 9, 2022
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 9, 2022
self.value_data.as_ptr().offset(start.to_isize().unwrap()),
(end - start).to_usize().unwrap(),
)
let start = self.value_offsets.get_unchecked(i).to_isize().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this to_isize() dance, really I just want ArrowNativeType to provide AsPrimitive so that we can do an unchecked cast. Will file a follow up PR

@@ -940,7 +938,7 @@ mod tests {
}

#[test]
#[should_panic(expected = "index out of bounds: the len is 10 but the index is 11")]
#[should_panic(expected = "index out of bounds: the len is 9 but the index is 10")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index changes because we check the length of the array, not the end offset first

@codecov-commenter
Copy link

Codecov Report

Merging #1825 (7a40745) into master (8f1fd12) will decrease coverage by 0.00%.
The diff coverage is 94.11%.

❗ Current head 7a40745 differs from pull request most recent head c5ba200. Consider uploading reports for the commit c5ba200 to get more accurate results

@@            Coverage Diff             @@
##           master    #1825      +/-   ##
==========================================
- Coverage   83.45%   83.44%   -0.01%     
==========================================
  Files         200      199       -1     
  Lines       56697    56685      -12     
==========================================
- Hits        47315    47301      -14     
- Misses       9382     9384       +2     
Impacted Files Coverage Δ
arrow/src/array/mod.rs 100.00% <ø> (ø)
arrow/src/buffer/scalar.rs 93.47% <ø> (ø)
arrow/src/array/array_map.rs 85.18% <90.90%> (+0.36%) ⬆️
arrow/src/array/array_list.rs 96.16% <92.30%> (+<0.01%) ⬆️
arrow/src/array/array_binary.rs 93.13% <93.75%> (-0.18%) ⬇️
arrow/src/array/array_string.rs 97.41% <94.73%> (-0.32%) ⬇️
arrow/src/array/array.rs 89.67% <100.00%> (ø)
arrow/src/array/array_boolean.rs 93.12% <100.00%> (-0.06%) ⬇️
arrow/src/array/array_primitive.rs 95.03% <100.00%> (-0.04%) ⬇️
parquet/src/encodings/encoding.rs 93.46% <0.00%> (-0.20%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f1fd12...c5ba200. Read the comment docs.

@alamb alamb changed the title Replace RawPtrBox with ScalarBuffer (#1811) Replace RawPtrBox with ScalarBuffer, reduce unsafe usage (#1811) Jun 9, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great @tustvold -- thank you.

The type cast encapsulation of ScalarBuffer seems just 👨‍🍳 👌 and I find this code much easier to follow than the previous calculations involving RawPtrBox

I think it is worth running some benchmarks here too to make sure this doesn't accidentally make performance worse. I wouldn't expect any change, but I do think it would be worth double checking.

cc @sunchao and @nevi-me

@@ -64,67 +64,40 @@ impl<OffsetSize: OffsetSizeTrait> GenericBinaryArray<OffsetSize> {

/// Returns a clone of the value data buffer
pub fn value_data(&self) -> Buffer {
self.data.buffers()[1].clone()
self.value_data.clone()
}

/// Returns the offset values in the offsets buffer
#[inline]
pub fn value_offsets(&self) -> &[OffsetSize] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -39,8 +39,8 @@ use crate::{buffer::MutableBuffer, datatypes::DataType};
/// binary data.
pub struct GenericBinaryArray<OffsetSize: OffsetSizeTrait> {
data: ArrayData,
value_offsets: RawPtrBox<OffsetSize>,
value_data: RawPtrBox<u8>,
value_offsets: ScalarBuffer<OffsetSize>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if calling ScalarBuffer something like TypedBuffer would make its purpose slightly clearer 🤔

let start = self.value_offsets.get_unchecked(i).to_isize().unwrap();
let end = self.value_offsets.get_unchecked(i + 1).to_isize().unwrap();

// buffer bounds/offset is ensured by the value_offset invariants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and to be clear, this method is unsafe so getting unchecked values seems like it would be ok...

@@ -573,6 +566,18 @@ mod tests {
.expect("All null array has valid array data");
}

#[test]
fn test_string_array_empty_offsets() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -1,67 +0,0 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 👍

@tustvold
Copy link
Contributor Author

tustvold commented Jun 9, 2022

Benchmarks show it making a fairly non-trivial difference to the following kernels:

  • like_utf8
  • length
  • bit_length
  • divide

I need to look a bit more into what is going on here

arrow-benches.txt

@tustvold tustvold marked this pull request as draft June 9, 2022 14:21
@jhorstmann
Copy link
Contributor

Looks good, nice improvement. Regarding the benchmarks, maybe the Deref for ScalarBuffer needs an explicit #[inline] annotation. Everything else in the hot methods, including the to_isize, should already get inlined, maybe the compiler decides the method is over some size threshold.

@tustvold
Copy link
Contributor Author

Sadly adding inline didn't seem to help. It is probably something trivial but I need some focus time to look into it.

@tustvold
Copy link
Contributor Author

tustvold commented Oct 5, 2022

I may revisit this later, but it isn't a priority right now

@@ -100,17 +92,15 @@ impl<T: ArrowPrimitiveType> PrimitiveArray<T> {
/// caller must ensure that the passed in offset is less than the array len()
#[inline]
pub unsafe fn value_unchecked(&self, i: usize) -> T::Native {
let offset = i + self.offset();
*self.raw_values.as_ptr().add(offset)
*self.raw_values.get_unchecked(i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value_unchecked seems more costly than before? Previously it simply takes value from the ptr (with an offset).

But get_unchecked on ScalarBuffer will create a slice first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice creation is unchecked, so they should compile down to the same thing. Whether they do though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid Value Offsets Slice For Empty Variable Length Arrays Replace RawPtrBox with Safe Abstraction
5 participants