Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsafe improvements: arrow/{array_reader,buffer}. #6025

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

veluca93
Copy link
Contributor

@veluca93 veluca93 commented Jul 8, 2024

This is a second step towards fixing #6020.

@tustvold
Copy link
Contributor

tustvold commented Jul 9, 2024

FWIW #4261 may be related here, which attempted to move these to less unsafe APIs. IIRC it regressed performance significantly and there wasn't any user-facing unsoundness so I just abandoned it

@veluca93
Copy link
Contributor Author

veluca93 commented Jul 9, 2024

FWIW #4261 may be related here, which attempted to move these to less unsafe APIs. IIRC it regressed performance significantly and there wasn't any user-facing unsoundness so I just abandoned it

I am not particularly concerned with the usage of the APIs being unsound (except in the cases where I explicitly mentioned it :-)), most of the FIXMEs are just a "it is probably OK, but I can't follow this method well enough to figure out why and it's probably best for whoever wrote it to document why it is OK"

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @veluca93 -- this looks like a nice improvement. I'll try and find some time to study the code more deeply tomorrow to help describe the safety arguments a bit more

@@ -179,6 +179,11 @@ where
.add_buffer(record_data)
.null_bit_buffer(self.record_reader.consume_bitmap_buffer());

// SAFETY: Assuming that the RecordReader's null buffer is sufficiently sized for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safety argument goes something like the RecordReader writes only valid values for type T into record_data and that the record_reader's null buffer was correctly created.

@@ -173,6 +173,7 @@ impl ArrayReader for StructArrayReader {
array_data_builder.null_bit_buffer(Some(bitmap_builder.into()));
}

// SAFETY: FIXME: document why this call is safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument basically goes something like the code in this module correctly constructed the struct array values and validated the input during the construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's what I assumed, but the method was a bit too long and I am not familiar enough with the library to really figure out the details here...

@@ -162,6 +162,10 @@ impl<K: ArrowNativeType + Ord, V: OffsetSizeTrait> DictionaryBuffer<K, V> {

let data = match cfg!(debug_assertions) {
true => builder.build().unwrap(),
// SAFETY: FIXME: this is unsound. data_type is passed by the caller without
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will review this more carefully -- I think this code is still sound as the only callers of data_type pass in the correct type.

However, maybe we could double check the data_type with an assert

Copy link
Contributor Author

@veluca93 veluca93 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking data_type SGTM.

Technically speaking, relying on callers doing "the right thing" is unsound (which is not the same as "exhibits undefined behaviour"): unsafe code is sound if no safe code could cause UB, regardless of current usage.

Of course, if this is not exposed to the user, it is less important to fix, but would still be nice IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this PR is still on my list to help clean, but I am struggling to find time.

Basically i plan to do some more code investigation to improve the comments here.

If there are any specific API changes / proposal to improve safety I think we should file a separate ticket to discuss them. I will do so if I find one during analysis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, take your time! I find with these things speediness should not trump thoroughness :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @veluca93 -- I wonder if you have had any more time to do some more code investigation? If you think this is actually unsound I will try and find more time to review this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a lot of time - I've been busy and I am at a programming competition this week - but I can confirm that the code as written is unsound (even if the unsoundness cannot trigger undefined behaviour): validating the data type, or marking the function as unsafe and writing down in the caller why the data type is correct would fix the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb - any updates on when you might be able to dig a bit more into the safety invariants of this code? :-) unfortunately I don't think I am familiar enough with the codebase to effectively look more into it myself...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am likewise not familar with this part of the code -- and I haven't been able to find time to spend studying it in detail.

One of the nice things about open source code is that it is all available for everyone to look -- I welcome additional thoughts / reviews / assistance on this item

FRom my perspective, this ticket is current a reasearch item as it is a seemingly theoretical problem. Thus it will likely will remain a relatively low priority for me personally unless someone can cook up a rust example showing actual unsafe/unsound behavior resulting

thank you again for the help. It would be great if you could find time to analyze the code or make an example showing a problem

parquet/src/arrow/buffer/offset_buffer.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants