Files written with Snappy codec cannot be read by pyarrow #3134

abaranec · 2022-12-01T17:22:44Z

While trying to verify that Parquet files written by DH can be read by external tools, I discovered that when the SNAPPY codec is used they cannot be. I used the ParquetTableReadWriteTest to generate a parquet file on disk and then attempted to read that file with pyarrow, which resulted in the following error:

>>> pq.read_table('/Users/abaranec/git/deephaven-core/extensions/parquet/table/io.deephaven.parquet.table.ParquetTableReadWriteTest_root/ParquetTest_smallFlatParquet_test.parquet') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/abaranec/pyarrow/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 2872, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/abaranec/pyarrow/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 2519, in read use_threads=use_threads File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status OSError: Corrupt snappy compressed data.

The text was updated successfully, but these errors were encountered:

abaranec · 2022-12-01T17:27:07Z

Additional note: I ran this on an M1 mac. I need to re-run on an x86 platform

jcferretti · 2022-12-02T04:52:16Z

I was able to write a SNAPPY compressed parquet file from Ubuntu-22.04 x64 and read it back in pyarrow successfully. See below.

Server console script:

import deephaven.table_factory as tf
t = tf.empty_table(10).update('A = i')
import deephaven.parquet as pq
pq.write(t, '/data/t.parquet', None, None, 'SNAPPY', None)

Command line (assumes a python venv with pyarrow installed)

cfs@caicai 23:49:17 ~/py
$ python3
Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet as pq
>>> pq.read_table('/home/cfs/dh/oss1/deephaven-core/data/t.parquet')
pyarrow.Table
A: int32
----
A: [[0,1,2,3,4,5,6,7,8,9]]
>>> 
(pyarrow) 
cfs@caicai 23:49:29 ~/py
$ 
(pyarrow) 
cfs@caicai 23:49:30 ~/py
$ /l/parquet-cli/1.12.1/bin/parquet-cli cat /home/cfs/dh/oss1/deephaven-core/data/t.parquet
{"A": 0}
{"A": 1}
{"A": 2}
{"A": 3}
{"A": 4}
{"A": 5}
{"A": 6}
{"A": 7}
{"A": 8}
{"A": 9}
(pyarrow) 
cfs@caicai 23:49:38 ~/py
$ /l/parquet-cli/1.12.1/bin/parquet-cli meta /home/cfs/dh/oss1/deephaven-core/data/t.parquet

File path:  /home/cfs/dh/oss1/deephaven-core/data/t.parquet
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
  deephaven: {"version":"0.4.0"}
Schema:
message root {
  optional int32 A (INTEGER(32,true));
}


Row group 0:  count: 10  6.20 B records  start: 4  total: 62 B
--------------------------------------------------------------------------------
   type      encodings count     avg size   nulls   min / max
A  INT32     S   _     10        6.20 B     0       

(pyarrow)

abaranec · 2022-12-02T13:33:22Z

The file I wrote on M1 contains one column for each of our supported types (primitives, String, BigInteger and BigDecimal). I wonder if this could be one of the types.

rcaudy · 2022-12-02T14:39:57Z

The file I wrote on M1 contains one column for each of our supported types (primitives, String, BigInteger and BigDecimal). I wonder if this could be one of the types.

That seems like a very likely hypothesis. It also might be worth seeing if it only happens with dictionary-encoded columns

rcaudy · 2022-12-02T15:55:01Z

@abaranec Can you add your test code to this ticket?

niloc132 · 2022-12-02T17:43:07Z

We've narrowed this down to only breaking for string columns where every value is null in that column.

from deephaven import empty_table
import pandas
from deephaven.parquet import write
t = empty_table(5).update("Name=(String) null")
write(t, "nullString.parquet", compression_codec_name="SNAPPY")
arrow = pyarrow.parquet.read_table('nullString.parquet')

Other types (BigInteger, DateTime, int/char/long) don't trigger this, and if at least one value is non null, the bug won't happen.

This breaks on m1 and x86.

Fixes #3134

abaranec added bug Something isn't working triage parquet Related to the Parquet integration labels Dec 1, 2022

abaranec added this to the Dec 2022 milestone Dec 1, 2022

niloc132 mentioned this issue Dec 2, 2022

Don't write an all-null column with a dictionary #3141

Merged

niloc132 closed this as completed in #3141 Dec 6, 2022

niloc132 added a commit that referenced this issue Dec 6, 2022

Don't write an all-null column with a dictionary (#3141)

c22c68c

Fixes #3134

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files written with Snappy codec cannot be read by pyarrow #3134

Files written with Snappy codec cannot be read by pyarrow #3134

abaranec commented Dec 1, 2022

abaranec commented Dec 1, 2022

jcferretti commented Dec 2, 2022

abaranec commented Dec 2, 2022

rcaudy commented Dec 2, 2022

rcaudy commented Dec 2, 2022

niloc132 commented Dec 2, 2022

Files written with Snappy codec cannot be read by pyarrow #3134

Files written with Snappy codec cannot be read by pyarrow #3134

Comments

abaranec commented Dec 1, 2022

abaranec commented Dec 1, 2022

jcferretti commented Dec 2, 2022

abaranec commented Dec 2, 2022

rcaudy commented Dec 2, 2022

rcaudy commented Dec 2, 2022

niloc132 commented Dec 2, 2022