-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable #13275
Conversation
@@ -126,6 +126,12 @@ class PARQUET_EXPORT ColumnChunkMetaData { | |||
const ApplicationVersion* writer_version = NULLPTR, int16_t row_group_ordinal = -1, | |||
int16_t column_ordinal = -1, | |||
std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the old *MetaData::Make
signatures for compatibility, but perhaps that's not important? @emkornfield @lidavidm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just keep it. Maybe put a deprecation on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to deprecation.
876f473
to
cd43a69
Compare
b3a5961
to
e440afe
Compare
8331178
to
24d6ad7
Compare
In fringe cases, users may have Parquet files where deserializing exceeds our default Thrift size limits.
24d6ad7
to
c47ddb6
Compare
@github-actions crossbow submit -g cpp |
@@ -223,14 +223,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) { | |||
class SerializedPageReader : public PageReader { | |||
public: | |||
SerializedPageReader(std::shared_ptr<ArrowInputStream> stream, int64_t total_num_rows, | |||
Compression::type codec, ::arrow::MemoryPool* pool, | |||
Compression::type codec, const ReaderProperties& properties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is properties a movable type? should we pass by value and move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we probably can.
// set to the actual length of the header. | ||
template <class T> | ||
void DeserializeMessage(const uint8_t* buf, uint32_t* len, T* deserialized_msg, | ||
const std::shared_ptr<Decryptor>& decryptor = NULLPTR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why shared_ptr here? If this is needed for async operations doesn't a copy of the shared_ptr need to be taken? If it isn't pass by normal pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is just moving around code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just the original signature in the previous code, but I can change it.
|
||
@thrift_string_size_limit.setter | ||
def thrift_string_size_limit(self, size): | ||
if size <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these checks be pushed into C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to validate Parquet properties values on the C++ side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick pass through. LGTM, any plans to add a sample file with overrides that would pass/fail to verify end-to-end functionality (sorry if I missed this).
The Python tests generate such a file on the fly. |
Revision: c47ddb6 Submitted crossbow builds: ursacomputing/crossbow @ actions-1775fb24b1 |
Sorry, I missed the CI failure introduced by this PR in the Numpydoc checks. |
In fringe cases, users may have Parquet files where deserializing exceeds our default Thrift size limits.