Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable #13275

Merged
merged 5 commits into from
Jun 7, 2022

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented May 31, 2022

In fringe cases, users may have Parquet files where deserializing exceeds our default Thrift size limits.

@pitrou pitrou requested a review from emkornfield May 31, 2022 16:20
@@ -126,6 +126,12 @@ class PARQUET_EXPORT ColumnChunkMetaData {
const ApplicationVersion* writer_version = NULLPTR, int16_t row_group_ordinal = -1,
int16_t column_ordinal = -1,
std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the old *MetaData::Make signatures for compatibility, but perhaps that's not important? @emkornfield @lidavidm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just keep it. Maybe put a deprecation on it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to deprecation.

@github-actions
Copy link

@pitrou pitrou force-pushed the ARROW-16546-thrift-limits branch from 8331178 to 24d6ad7 Compare June 2, 2022 15:03
@pitrou pitrou force-pushed the ARROW-16546-thrift-limits branch from 24d6ad7 to c47ddb6 Compare June 7, 2022 07:31
@pitrou pitrou requested a review from kou June 7, 2022 07:33
@pitrou
Copy link
Member Author

pitrou commented Jun 7, 2022

@github-actions crossbow submit -g cpp

@@ -223,14 +223,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
class SerializedPageReader : public PageReader {
public:
SerializedPageReader(std::shared_ptr<ArrowInputStream> stream, int64_t total_num_rows,
Compression::type codec, ::arrow::MemoryPool* pool,
Compression::type codec, const ReaderProperties& properties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is properties a movable type? should we pass by value and move?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we probably can.

// set to the actual length of the header.
template <class T>
void DeserializeMessage(const uint8_t* buf, uint32_t* len, T* deserialized_msg,
const std::shared_ptr<Decryptor>& decryptor = NULLPTR) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shared_ptr here? If this is needed for async operations doesn't a copy of the shared_ptr need to be taken? If it isn't pass by normal pointer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is just moving around code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just the original signature in the previous code, but I can change it.


@thrift_string_size_limit.setter
def thrift_string_size_limit(self, size):
if size <= 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these checks be pushed into C++?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to validate Parquet properties values on the C++ side.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick pass through. LGTM, any plans to add a sample file with overrides that would pass/fail to verify end-to-end functionality (sorry if I missed this).

@pitrou
Copy link
Member Author

pitrou commented Jun 7, 2022

any plans to add a sample file with overrides that would pass/fail to verify end-to-end functionality (sorry if I missed this)

The Python tests generate such a file on the fly.

@github-actions
Copy link

github-actions bot commented Jun 7, 2022

Revision: c47ddb6

Submitted crossbow builds: ursacomputing/crossbow @ actions-1775fb24b1

Task Status
test-build-cpp-fuzz Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-debian-10-cpp-amd64 Github Actions
test-debian-10-cpp-i386 Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-fedora-35-cpp Github Actions
test-ubuntu-18.04-cpp Github Actions
test-ubuntu-18.04-cpp-release Github Actions
test-ubuntu-18.04-cpp-static Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-14 Github Actions
test-ubuntu-20.04-cpp-17 Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-22.04-cpp Github Actions

@pitrou pitrou merged commit dd33c75 into apache:master Jun 7, 2022
@pitrou pitrou deleted the ARROW-16546-thrift-limits branch June 7, 2022 11:29
@pitrou
Copy link
Member Author

pitrou commented Jun 7, 2022

Sorry, I missed the CI failure introduced by this PR in the Numpydoc checks.
Quick fix in #13331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants