Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files written with Snappy codec cannot be read by pyarrow #3134

Closed
abaranec opened this issue Dec 1, 2022 · 6 comments · Fixed by #3141
Closed

Files written with Snappy codec cannot be read by pyarrow #3134

abaranec opened this issue Dec 1, 2022 · 6 comments · Fixed by #3141
Labels
bug Something isn't working parquet Related to the Parquet integration triage
Milestone

Comments

@abaranec
Copy link
Contributor

abaranec commented Dec 1, 2022

While trying to verify that Parquet files written by DH can be read by external tools, I discovered that when the SNAPPY codec is used they cannot be. I used the ParquetTableReadWriteTest to generate a parquet file on disk and then attempted to read that file with pyarrow, which resulted in the following error:

>>> pq.read_table('/Users/abaranec/git/deephaven-core/extensions/parquet/table/io.deephaven.parquet.table.ParquetTableReadWriteTest_root/ParquetTest_smallFlatParquet_test.parquet') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/abaranec/pyarrow/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 2872, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/abaranec/pyarrow/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 2519, in read use_threads=use_threads File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status OSError: Corrupt snappy compressed data.

@abaranec abaranec added bug Something isn't working triage parquet Related to the Parquet integration labels Dec 1, 2022
@abaranec abaranec added this to the Dec 2022 milestone Dec 1, 2022
@abaranec
Copy link
Contributor Author

abaranec commented Dec 1, 2022

Additional note: I ran this on an M1 mac. I need to re-run on an x86 platform

@jcferretti
Copy link
Member

I was able to write a SNAPPY compressed parquet file from Ubuntu-22.04 x64 and read it back in pyarrow successfully. See below.

Capture_045

Capture_046

Server console script:

import deephaven.table_factory as tf
t = tf.empty_table(10).update('A = i')
import deephaven.parquet as pq
pq.write(t, '/data/t.parquet', None, None, 'SNAPPY', None)

Command line (assumes a python venv with pyarrow installed)

cfs@caicai 23:49:17 ~/py
$ python3
Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet as pq
>>> pq.read_table('/home/cfs/dh/oss1/deephaven-core/data/t.parquet')
pyarrow.Table
A: int32
----
A: [[0,1,2,3,4,5,6,7,8,9]]
>>> 
(pyarrow) 
cfs@caicai 23:49:29 ~/py
$ 
(pyarrow) 
cfs@caicai 23:49:30 ~/py
$ /l/parquet-cli/1.12.1/bin/parquet-cli cat /home/cfs/dh/oss1/deephaven-core/data/t.parquet
{"A": 0}
{"A": 1}
{"A": 2}
{"A": 3}
{"A": 4}
{"A": 5}
{"A": 6}
{"A": 7}
{"A": 8}
{"A": 9}
(pyarrow) 
cfs@caicai 23:49:38 ~/py
$ /l/parquet-cli/1.12.1/bin/parquet-cli meta /home/cfs/dh/oss1/deephaven-core/data/t.parquet

File path:  /home/cfs/dh/oss1/deephaven-core/data/t.parquet
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
  deephaven: {"version":"0.4.0"}
Schema:
message root {
  optional int32 A (INTEGER(32,true));
}


Row group 0:  count: 10  6.20 B records  start: 4  total: 62 B
--------------------------------------------------------------------------------
   type      encodings count     avg size   nulls   min / max
A  INT32     S   _     10        6.20 B     0       

(pyarrow) 

@abaranec
Copy link
Contributor Author

abaranec commented Dec 2, 2022

The file I wrote on M1 contains one column for each of our supported types (primitives, String, BigInteger and BigDecimal). I wonder if this could be one of the types.

@rcaudy
Copy link
Member

rcaudy commented Dec 2, 2022

The file I wrote on M1 contains one column for each of our supported types (primitives, String, BigInteger and BigDecimal). I wonder if this could be one of the types.

That seems like a very likely hypothesis. It also might be worth seeing if it only happens with dictionary-encoded columns

@rcaudy
Copy link
Member

rcaudy commented Dec 2, 2022

@abaranec Can you add your test code to this ticket?

@niloc132
Copy link
Member

niloc132 commented Dec 2, 2022

We've narrowed this down to only breaking for string columns where every value is null in that column.

from deephaven import empty_table
import pandas
from deephaven.parquet import write
t = empty_table(5).update("Name=(String) null")
write(t, "nullString.parquet", compression_codec_name="SNAPPY")
arrow = pyarrow.parquet.read_table('nullString.parquet')

Other types (BigInteger, DateTime, int/char/long) don't trigger this, and if at least one value is non null, the bug won't happen.

This breaks on m1 and x86.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parquet Related to the Parquet integration triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants