Some types don't seem to round-trip cleanly through parquet and pandas #3151

niloc132 · 2022-12-06T14:15:32Z

Description

It is possible that this is not a valid bug, but worth investigating. Three expressions in PR #3141 were left commented out because they don't roundtrip cleanly through pandas and parquet, likely due to how python handles these types rather than specifics of parquet or deephaven's internal parquet implementation.

This issue is to follow-up on what is going wrong here, to make sure that these are acceptable, or document how to deal with them.

Basic setup:

import pandas
from deephaven.pandas import to_pandas, to_table

from deephaven import empty_table, dtypes, new_table
from deephaven.column import InputColumn
from deephaven.parquet import write, batch_write, read, delete, ColumnInstruction
from deephaven.table import Table

dh_table = empty_table(20).update(formulas=[
    # add columns here
])
write(dh_table, "data_from_dh.parquet")
dataframe = pandas.read_parquet('data_from_dh.parquet', use_nullable_dtypes=True)
result_table = to_table(dataframe)
dataframe.to_parquet('data_from_pandas.parquet')

result_table = read('data_from_pandas.parquet')

`"someTime = DateTime.now() + i"`

For this, pyarrow emits a message indicating that precision is lost in a way that doesn't make sense to me:

java.lang.RuntimeException: Error in Python interpreter:

Type: <class 'pyarrow.lib.ArrowInvalid'>
Value: Casting from timestamp[ns, tz=UTC] to timestamp[us] would lose data: 1670337636914439001
Line: 100
Namespace: pyarrow.lib.check_status
File: pyarrow/error.pxi
Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pandas/util/_decorators.py", line 207, in wrapper
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pandas/core/frame.py", line 2685, in to_parquet
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pandas/io/parquet.py", line 423, in to_parquet
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pandas/io/parquet.py", line 195, in write
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 2985, in write_table
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 1054, in write_table
  File "pyarrow/_parquet.pyx", line 1772, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status

The error acknowledges that the data is formatted as nanosecond, but for some reason is attempting to write to disk as microsecond?

`"nullBigDecColumn = (java.math.BigDecimal)null"` and `"nullBigIntColumn = (java.math.BigInteger)null"`

For both of these, when passed DH -> parquet -> pandas -> DH, the resulting column is a primitive int rather than the java type, likely since there is no data, and with no data the scale ends up as 0.

Versions

Deephaven: 0.18+
OS: Linux
Browser: N/A
Docker: N/A

The text was updated successfully, but these errors were encountered:

rcaudy · 2022-12-06T14:21:44Z

My expectation is that DH -> parquet -> DH should be "perfect". Meaning, the result data should match the input data.
I would not expect the same level of fidelity when passing through pandas; pandas will not recognize DH metadata and consequently won't be able to reconstruct some of our fine-grained type information.

That said, there are two avenues we should explore here:

Are we expressing our BigIntegers and BigDecimals in the best way when written to Parquet? We recently changed to use decimal primitives with this kind of use-case in mind, so hopefully the answer is an unqualified "yes".
Is the pandas->Deephaven conversion step losing precision that we should be able to keep? If all the values are null, I would say we're not losing anything. It may be worth testing a more diverse range of values.

rcaudy · 2023-02-27T23:06:25Z

I think we may be able to improve this once #3455 is done.

malhotrashivam · 2023-08-08T21:58:31Z

Comment from ParquetTableReadWriteTest.java where we have special handling for testing BigDecimal values:

Encoding bigDecimal is tricky -- the writer will try to pick the precision and scale automatically. Because of that tableTools.assertTableEquals will fail because, even though the numbers are identical, the representation may not be so we have to coerce the expected values to the same precision and scale value. We know how it should be doing it, so we can use the same pattern of encoding/decoding with the codec.

We can replicate similar handling in Python testing code as well.

niloc132 added bug Something isn't working triage python parquet Related to the Parquet integration python-server-side labels Dec 6, 2022

rcaudy assigned jmao-denver Dec 7, 2022

rcaudy assigned malhotrashivam Jul 21, 2023

rcaudy added this to the July 2023 milestone Jul 21, 2023

jmao-denver removed the triage label Aug 10, 2023

jmao-denver removed their assignment Aug 10, 2023

pete-petey modified the milestones: July 2023, February 2024, 2. February 2024 (end of month), 2023 tickets than need to be 'milestone-ed' and assigned Dec 29, 2023

pete-petey added the 2023_triagedNoMilestone label Aug 26, 2024

pete-petey modified the milestones: Year 2023 tickets that need to be 'milestone-ed' and assigned, 5. Backlog Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some types don't seem to round-trip cleanly through parquet and pandas #3151

Some types don't seem to round-trip cleanly through parquet and pandas #3151

niloc132 commented Dec 6, 2022 •

edited

Loading

rcaudy commented Dec 6, 2022

rcaudy commented Feb 27, 2023

malhotrashivam commented Aug 8, 2023 •

edited

Loading

Some types don't seem to round-trip cleanly through parquet and pandas #3151

Some types don't seem to round-trip cleanly through parquet and pandas #3151

Comments

niloc132 commented Dec 6, 2022 • edited Loading

"someTime = DateTime.now() + i"

"nullBigDecColumn = (java.math.BigDecimal)null" and "nullBigIntColumn = (java.math.BigInteger)null"

rcaudy commented Dec 6, 2022

rcaudy commented Feb 27, 2023

malhotrashivam commented Aug 8, 2023 • edited Loading

niloc132 commented Dec 6, 2022 •

edited

Loading

`"someTime = DateTime.now() + i"`

`"nullBigDecColumn = (java.math.BigDecimal)null"` and `"nullBigIntColumn = (java.math.BigInteger)null"`

malhotrashivam commented Aug 8, 2023 •

edited

Loading