Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas 2.0 with pyarrow backend: "TypeError: Cannot interpret 'timestamp[ms][pyarrow]' as a data type" #3127

Closed
sacundim opened this issue Jul 28, 2023 · 13 comments · Fixed by #3128
Labels

Comments

@sacundim
Copy link

  • Vega-Altair 5.0.1
  • Pandas 2.0.3
  • PyArrow 12.0.1

Essential outline of what I'm doing:

import pandas as pd

arrow_table = [make an Arrow table]
pandas_df = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)
Stack trace from my actual app
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 948, in save
    result = save(**kwds)
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/save.py", line 131, in save
    spec = chart.to_dict()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 838, in to_dict
    copy.data = _prepare_data(original_data, context)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 100, in _prepare_data
    data = _pipe(data, data_transformers.get())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/toolz/functoolz.py", line 628, in pipe
    data = func(data)
           ^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/toolz/functoolz.py", line 304, in __call__
    return self._partial(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/data.py", line 19, in default_data_transformer
    return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/toolz/functoolz.py", line 628, in pipe
    data = func(data)
           ^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/toolz/functoolz.py", line 304, in __call__
    return self._partial(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/data.py", line 160, in to_values
    data = sanitize_dataframe(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/core.py", line 383, in sanitize_dataframe
    elif np.issubdtype(dtype, np.integer):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/numpy/core/numerictypes.py", line 417, in issubdtype
    arg1 = dtype(arg1).type
           ^^^^^^^^^^^
TypeError: Cannot interpret 'timestamp[ms][pyarrow]' as a data type
@sacundim sacundim added the bug label Jul 28, 2023
sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Jul 28, 2023
@sacundim
Copy link
Author

I just tried it with this in my app's pyproject.toml and same failure:

[tool.poetry.dependencies]
altair = { git = "https://github.com/altair-viz/altair.git", rev = "72a361c" }

@mattijn
Copy link
Contributor

mattijn commented Jul 28, 2023

Thanks for raising @sacundim! Can you check if the issue you are describing is a duplicate of #3050, recently fixed by PR #3076?

@jonmmease
Copy link
Contributor

Can you check if the issue you are describing is a duplicate of #3050, recently fixed by PR #3076?

I don't think it is, #3076 was about fixing a sanitization issue for pyarrow / __dataframe__ objects. This issue is about pandas sanitization logic failing for Pandas 2.0 series that are backed by pyarrow. I'll try to take a look soon, this might be another motivation for unifying the sanitization logic to operate on the DataFrame interchange protocol like we've talked about in the past (#3076 (comment)).

@sacundim
Copy link
Author

@mattijn The commit I mention in my second comment (72a361c) was the most recent state of the main branch when I made that comment, and later than the merge of #3076.

I was also recently playing with Polars and indeed I had type-related problems with Vega-Altair 5.0.1, I saw PR #3076, and I got over those problems precisely by using commit 427d5679, the one that merged #3076.

So with commits that have #3076 I have indeed observed this app working with Polars frames but not with Arrow-backed Pandas 2.0 frames.

@sacundim
Copy link
Author

@jonmmease I'm obviously not nearly as deep into this as you are but given how central Arrow is becoming to the whole ecosystem and various comments I read about how support for this dataframe interchange protocol is still experimental, I wonder if just building first-class support for Arrow might be wise.

@jonmmease
Copy link
Contributor

Hi @sacundim, yeah we totally agree. We're working on moving various parts of Altair from depending directly on the pandas API to working with the combination of the DataFrame Interchange Protocol and pyarrow. One recent example was #3114, where I updated the encoding type inference to rely on the DataFrame interchange protocol when available.

One note is that we'll need to keep the pandas-centric logic around for a while because we still support versions of pandas before the DataFrame Interchange Protocol was introduced, and we want Altair to work in environments like Pyodide where pyarrow isn't available yet.

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Jul 28, 2023
… order to reconfirm that Vega-Altair works just fine there, and the issue that I report for Pandas 2.0 is happening only for Pandas 2.0. vega/altair#3127
@sacundim
Copy link
Author

Just to make things a bit clearer; (1) I've got two experimental branches in my project in question:

  • polars-try2
  • pandas-2.0

(2) I've just now got both using Vega-Altair from the GitHub main branch:

[tool.poetry.dependencies]
altair = { git = "https://github.com/altair-viz/altair.git", rev = "72a361c" }

(3) The following function in my polars-try2 branch uses PyAthena 3.0.6 to produce pyarrow tables, wraps them into Polars dataframes, and Vega-Altair understands the latter just fine:

def execute_polars(athena, sql, params={}):
    with athena.cursor(ArrowCursor) as cursor:
        return pl.from_arrow(cursor.execute(sql, params).as_arrow())

(4) But the following function in my pandas-2.0 branch uses PyAthena 3.0.6 to produce pyarrow tables the exact way, differs only in that it packages them into Pandas 2.0 dataframes, and gets the error that this ticket reports:

def execute_pandas(athena, query, params={}):
    """Execute a query with PyAthena, return the result set as Pandas"""
    with athena.cursor(ArrowCursor) as cursor:
        arrow = cursor.execute(query, params).as_arrow()
        return arrow.to_pandas(types_mapper=pd.ArrowDtype)

@jonmmease
Copy link
Contributor

@sacundim, could you try out this branch when you have a chance?

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Jul 29, 2023
…sue: vega/altair#3127

Different error:

```
  File "/usr/local/lib/python3.11/site-packages/pandas/core/interchange/utils.py", line 90, in dtype_to_arrow_c_fmt
    raise NotImplementedError(
NotImplementedError: Conversion of timestamp[ms][pyarrow] to Arrow C format string is not implemented.
```
@sacundim
Copy link
Author

@jonmmease Different error now (full stack trace at bottom):

NotImplementedError: Conversion of timestamp[ms][pyarrow] to Arrow C format string is not implemented.

Might not be the exact same code path on my side as the original failure

Stack trace
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 1066, in save
    result = save(**kwds)
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/save.py", line 189, in save
    perform_save()
  File "/usr/local/lib/python3.11/site-packages/altair/utils/save.py", line 127, in perform_save
    spec = chart.to_dict(context={"pre_transform": False})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 903, in to_dict
    vegalite_spec = super(TopLevelMixin, copy).to_dict(  # type: ignore[misc]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 807, in to_dict
    result = _todict(
             ^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 340, in _todict
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 340, in <dictcomp>
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 338, in _todict
    return [_todict(v, context) for v in obj]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 338, in <listcomp>
    return [_todict(v, context) for v in obj]
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 336, in _todict
    return obj.to_dict(validate=False, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 2677, in to_dict
    return super().to_dict(
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 903, in to_dict
    vegalite_spec = super(TopLevelMixin, copy).to_dict(  # type: ignore[misc]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 807, in to_dict
    result = _todict(
             ^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 340, in _todict
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 340, in <dictcomp>
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 336, in _todict
    return obj.to_dict(validate=False, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 807, in to_dict
    result = _todict(
             ^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 340, in _todict
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 340, in <dictcomp>
    return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/schemapi.py", line 336, in _todict
    return obj.to_dict(validate=False, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/vegalite/v5/schema/channels.py", line 34, in to_dict
    parsed = parse_shorthand(shorthand, data=context.get('data', None))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/core.py", line 596, in parse_shorthand
    attrs["type"] = infer_vegalite_type_for_dfi_column(column)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/altair/utils/core.py", line 648, in infer_vegalite_type_for_dfi_column
    raise e
  File "/usr/local/lib/python3.11/site-packages/altair/utils/core.py", line 638, in infer_vegalite_type_for_dfi_column
    kind = column.dtype[0]
           ^^^^^^^^^^^^
  File "pandas/_libs/properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/usr/local/lib/python3.11/site-packages/pandas/core/interchange/column.py", line 126, in dtype
    return self._dtype_from_pandasdtype(dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/interchange/column.py", line 141, in _dtype_from_pandasdtype
    return kind, dtype.itemsize * 8, dtype_to_arrow_c_fmt(dtype), dtype.byteorder
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/interchange/utils.py", line 90, in dtype_to_arrow_c_fmt
    raise NotImplementedError(
NotImplementedError: Conversion of timestamp[ms][pyarrow] to Arrow C format string is not implemented.

@sacundim
Copy link
Author

sacundim commented Jul 29, 2023

@jonmmease Aha, got it, the following excerpt of this code leads to the chart failing with the error shown above:

        data_date = alt.Chart(df).mark_text(baseline='middle').encode(
            text=alt.Text('bulletin_date',
                          type='temporal',
                          aggregate='max',
                          timeUnit='yearmonthdate',
                          format='Datos hasta: %A %d de %B, %Y'),
        )

...whereas the following one works just fine:

        data_date = alt.Chart(df).mark_text(baseline='middle').encode(
            text=alt.Text('bulletin_date:T',
                          aggregate='max',
                          timeUnit='yearmonthdate',
                          format='Datos hasta: %A %d de %B, %Y'),
        )

@jonmmease
Copy link
Contributor

Thanks @sacundim, I see what's going on. Should be fixed on the same branch now.

sacundim pushed a commit to sacundim/covid-19-puerto-rico that referenced this issue Jul 29, 2023
@sacundim
Copy link
Author

@jonmmease Revision 57a4ad7 works on my end, thanks!

@mattijn
Copy link
Contributor

mattijn commented Jul 30, 2023

Thanks for reviewing @sacundim! Seems like a nice project you are working on. I can't judge the content, but it is nice to see these type of extended Altair code specifications! If you have other suggestions how the package can be improved, please feel free to open a new issue or discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants