Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some types don't seem to round-trip cleanly through parquet and pandas #3151

Open
niloc132 opened this issue Dec 6, 2022 · 3 comments
Open
Assignees
Labels
2023_triagedNoMilestone bug Something isn't working parquet Related to the Parquet integration python python-server-side
Milestone

Comments

@niloc132
Copy link
Member

niloc132 commented Dec 6, 2022

Description

It is possible that this is not a valid bug, but worth investigating. Three expressions in PR #3141 were left commented out because they don't roundtrip cleanly through pandas and parquet, likely due to how python handles these types rather than specifics of parquet or deephaven's internal parquet implementation.

This issue is to follow-up on what is going wrong here, to make sure that these are acceptable, or document how to deal with them.

Basic setup:

import pandas
from deephaven.pandas import to_pandas, to_table

from deephaven import empty_table, dtypes, new_table
from deephaven.column import InputColumn
from deephaven.parquet import write, batch_write, read, delete, ColumnInstruction
from deephaven.table import Table

dh_table = empty_table(20).update(formulas=[
    # add columns here
])
write(dh_table, "data_from_dh.parquet")
dataframe = pandas.read_parquet('data_from_dh.parquet', use_nullable_dtypes=True)
result_table = to_table(dataframe)
dataframe.to_parquet('data_from_pandas.parquet')

result_table = read('data_from_pandas.parquet')

"someTime = DateTime.now() + i"

For this, pyarrow emits a message indicating that precision is lost in a way that doesn't make sense to me:

java.lang.RuntimeException: Error in Python interpreter:

Type: <class 'pyarrow.lib.ArrowInvalid'>
Value: Casting from timestamp[ns, tz=UTC] to timestamp[us] would lose data: 1670337636914439001
Line: 100
Namespace: pyarrow.lib.check_status
File: pyarrow/error.pxi
Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pandas/util/_decorators.py", line 207, in wrapper
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pandas/core/frame.py", line 2685, in to_parquet
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pandas/io/parquet.py", line 423, in to_parquet
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pandas/io/parquet.py", line 195, in write
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 2985, in write_table
  File "/home/colin/.pyenv/versions/3.7.12/lib/python3.7/site-packages/pyarrow/parquet/core.py", line 1054, in write_table
  File "pyarrow/_parquet.pyx", line 1772, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status

The error acknowledges that the data is formatted as nanosecond, but for some reason is attempting to write to disk as microsecond?

"nullBigDecColumn = (java.math.BigDecimal)null" and "nullBigIntColumn = (java.math.BigInteger)null"

For both of these, when passed DH -> parquet -> pandas -> DH, the resulting column is a primitive int rather than the java type, likely since there is no data, and with no data the scale ends up as 0.

Versions

  • Deephaven: 0.18+
  • OS: Linux
  • Browser: N/A
  • Docker: N/A
@niloc132 niloc132 added bug Something isn't working triage python parquet Related to the Parquet integration python-server-side labels Dec 6, 2022
@rcaudy
Copy link
Member

rcaudy commented Dec 6, 2022

My expectation is that DH -> parquet -> DH should be "perfect". Meaning, the result data should match the input data.
I would not expect the same level of fidelity when passing through pandas; pandas will not recognize DH metadata and consequently won't be able to reconstruct some of our fine-grained type information.

That said, there are two avenues we should explore here:

  1. Are we expressing our BigIntegers and BigDecimals in the best way when written to Parquet? We recently changed to use decimal primitives with this kind of use-case in mind, so hopefully the answer is an unqualified "yes".
  2. Is the pandas->Deephaven conversion step losing precision that we should be able to keep? If all the values are null, I would say we're not losing anything. It may be worth testing a more diverse range of values.

@rcaudy
Copy link
Member

rcaudy commented Feb 27, 2023

I think we may be able to improve this once #3455 is done.

@rcaudy rcaudy added this to the July 2023 milestone Jul 21, 2023
@malhotrashivam
Copy link
Contributor

malhotrashivam commented Aug 8, 2023

Comment from ParquetTableReadWriteTest.java where we have special handling for testing BigDecimal values:

Encoding bigDecimal is tricky -- the writer will try to pick the precision and scale automatically. Because of that tableTools.assertTableEquals will fail because, even though the numbers are identical, the representation may not be so we have to coerce the expected values to the same precision and scale value. We know how it should be doing it, so we can use the same pattern of encoding/decoding with the codec.

We can replicate similar handling in Python testing code as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023_triagedNoMilestone bug Something isn't working parquet Related to the Parquet integration python python-server-side
Projects
None yet
Development

No branches or pull requests

5 participants