-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files written with Snappy codec cannot be read by pyarrow #3134
Comments
Additional note: I ran this on an M1 mac. I need to re-run on an x86 platform |
I was able to write a SNAPPY compressed parquet file from Ubuntu-22.04 x64 and read it back in pyarrow successfully. See below. Server console script:
Command line (assumes a python venv with pyarrow installed)
|
The file I wrote on M1 contains one column for each of our supported types (primitives, String, BigInteger and BigDecimal). I wonder if this could be one of the types. |
That seems like a very likely hypothesis. It also might be worth seeing if it only happens with dictionary-encoded columns |
@abaranec Can you add your test code to this ticket? |
We've narrowed this down to only breaking for string columns where every value is null in that column.
Other types (BigInteger, DateTime, int/char/long) don't trigger this, and if at least one value is non null, the bug won't happen. This breaks on m1 and x86. |
While trying to verify that Parquet files written by DH can be read by external tools, I discovered that when the SNAPPY codec is used they cannot be. I used the ParquetTableReadWriteTest to generate a parquet file on disk and then attempted to read that file with pyarrow, which resulted in the following error:
>>> pq.read_table('/Users/abaranec/git/deephaven-core/extensions/parquet/table/io.deephaven.parquet.table.ParquetTableReadWriteTest_root/ParquetTest_smallFlatParquet_test.parquet') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/abaranec/pyarrow/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 2872, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/abaranec/pyarrow/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 2519, in read use_threads=use_threads File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status OSError: Corrupt snappy compressed data.
The text was updated successfully, but these errors were encountered: