Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow datetime is not supported using altair 5.0 #3050

Closed
djouallah opened this issue May 11, 2023 · 5 comments · Fixed by #3076
Closed

pyarrow datetime is not supported using altair 5.0 #3050

djouallah opened this issue May 11, 2023 · 5 comments · Fixed by #3076
Labels

Comments

@djouallah
Copy link

I created an altair chart from a duckdb dataframe exported as an arrow table, but I get this error

TypeError: Object of type datetime is not JSON serializable

when I cast the field to string everything works fine.

@djouallah djouallah added the bug label May 11, 2023
@mattijn
Copy link
Contributor

mattijn commented May 11, 2023

Thanks for the report! I can reproduce it with the following:

from datetime import datetime
import pyarrow as pa
import altair as alt

data = {    
    'date': [datetime(2004, 8, 1), datetime(2004, 9, 1)],
    'value': [102, 129]
 }
pa_table = pa.table(data)

alt.Chart(pa_table).mark_line().encode(
    x='date',
    y='value'
)
TypeError: Object of type datetime is not JSON serializable

Trying to narrow it down a bit further. The pa_table is a pyarrow.Table that contains two columns of type timestamp[us] and int64:

pa_table
pyarrow.Table
date: timestamp[us]
value: int64
----
date: [[2004-08-01 00:00:00.000000,2004-09-01 00:00:00.000000]]
value: [[102,129]]

The serialization of the arrow table is done using the dataframe interchange protocol available in pyarrow.

import pyarrow.interchange as pi

pi_table = pi.from_dataframe(pa_table)
pi_table
pyarrow.Table
date: timestamp[us]
price: int64
----
date: [[2004-08-01 00:00:00.000000,2004-09-01 00:00:00.000000]]
price: [[102,129]]

Where this object is converted into a pylist before inserted as values within the generated vega-lite specification:

pi_table.to_pylist()
[{'date': datetime.datetime(2004, 8, 1, 0, 0), 'price': 102},
 {'date': datetime.datetime(2004, 9, 1, 0, 0), 'price': 129}]

As can be seen, the date has become a datetime.datetime() timestamp. And this is problematic as this is not JSON serializable.

In Altair this happens around here: https://github.com/altair-viz/altair/blob/master/altair/utils/data.py#L168-L170

elif hasattr(data, "__dataframe__"):
    # experimental interchange dataframe support
    pi = import_pyarrow_interchange()
    pa_table = pi.from_dataframe(data)
    return {"values": pa_table.to_pylist()}

Currently we do not run the pi_table through sanitize_dataframe, something that does happen for pandas.DataFrames, In this process of sanitation we serialize type datetime as such:

elif str(dtype).startswith("datetime"):
    # Convert datetimes to strings. This needs to be a full ISO string
    # with time, which is why we cannot use ``col.astype(str)``.
    # This is because Javascript parses date-only times in UTC, but
    # parses full ISO-8601 dates as local time, and dates in Vega and
    # Vega-Lite are displayed in local time by default.
    # (see https://github.com/altair-viz/altair/issues/1027)
    df[col_name] = (
        df[col_name].apply(lambda x: x.isoformat()).replace("NaT", "")
    )

Not sure if we should sanitize interchange dataframes on the Python side or if it is better to push the pyarrow.Table using Arrow IPC serialization to the JavaScript side and construct valid objects there. Probably easier to include sanitation on the Python side.

@amol-
Copy link

amol- commented Jun 1, 2023

I think you can also explore the idea of extending the JSON serialiser with support for datetimes. For example that's what we have done in TurboGears ( https://github.com/TurboGears/tg2/blob/development/tg/jsonify.py#L94-L98 ).

That way you don't have to care about sanitising types explicitly because the json dumper will take care of it for you independently from where that value came from.

@AlenkaF
Copy link

AlenkaF commented Jun 1, 2023

If you decide to sanitize interchange dataframes on the Python side, you could use cast function (as @djouallah mentioned) and cast Timestamp to string to get the correct ISO format, for example:

>>> from datetime import datetime
>>> import pyarrow as pa
>>> arr = pa.array([datetime(2010, 1, 1), datetime(2015, 1, 1)])
>>> arr
<pyarrow.lib.TimestampArray object at 0x12f37a020>
[
  2010-01-01 00:00:00.000000,
  2015-01-01 00:00:00.000000
]
>>> import pyarrow.compute as pc
>>> pc.cast(arr, pa.string())
<pyarrow.lib.StringArray object at 0x12f37a0e0>
[
  "2010-01-01 00:00:00.000000",
  "2015-01-01 00:00:00.000000"
]

Note: in case of duration, the error should be raised.

@mattijn
Copy link
Contributor

mattijn commented Jun 7, 2023

Thanks for responding @AlenkaF! We've to cast dates to full ISO-8601 dates in order to render dates in local time in Altair, where dates cast as string will render as UTC in JavaScript..

@mattijn
Copy link
Contributor

mattijn commented Jun 7, 2023

Thanks also @amol- for raising ideas! Appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants