Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Python: Compute parquet stats (#7831)
* Add function to compute parquet file metadata * Addition of docstring and extra parameter to avoid reading the file unnecessarily * Refactor the statistics computation entirely to use pyarrow metadata This commit makes sure to test the metadata computation both using `pyarrow.parquet.ParqueWriter` and `pyarrow.parquet.write_to_dataset`. * Appease pre-commit hooks * Fix temporary path * Make the metrics mode configurable as documented here: https://iceberg.apache.org/docs/latest/configuration/ * Initialize binary serializers only once * Log arrow not implemented exception * Fix None comparison expression * Add map column to test data * Moving pyarrow specific code to io.pyarrow * type annotation * Refactor the stats collection using the pyarrow visitor * Clean redundant code and add warning message to the log * Address some of the review comments * Add tests to check of the number of columns found by the statistics collector is correct * We don't want to truncate numeric data types * Verify match of Iceberg types with Parquet physical types * Fix truncation of upper bounds * Transform asserts to ValueErrors * Add review suggestions * Address simple code style review comments * Fix potential null write * Apply function name refactoring * Move pyarrow statistics tests to a new file * Disable stats computation for nested types * Modularize the fill_parquet_file_metadata function * Allow metrics modes to have extra whitespace but not other trailing characters * Move upper bound truncation logic to another file * Be defensive with regards to missing row group statistics * Add tests for structs * Remove special treatment of UUIDType * Rely on parquet column path rather than column order This commit adds a visitor to compute a mapping from parquet column path to iceberg field ID. * Change mood to imperative to appease linter * Factor out the logic to obtain the current table schema
- Loading branch information