Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Test experimental data manager failing due to new JSON format for ExtensionArray dtypes #44994

Closed
phofl opened this issue Dec 21, 2021 · 7 comments · Fixed by #45001
Closed
Labels
CI Continuous Integration
Milestone

Comments

@phofl
Copy link
Member

phofl commented Dec 21, 2021

First commit on master causing this:

https://github.com/pandas-dev/pandas/runs/4576767389?check_suite_focus=true

This happened on the pr itself too:
#44722

@phofl phofl added the CI Continuous Integration label Dec 21, 2021
@phofl
Copy link
Member Author

phofl commented Dec 21, 2021

cc @pandas-dev/pandas-core

cc @jmg-duarte

@jmg-duarte
Copy link
Contributor

I'll gladly fix it given some guidance, as I have no clue what might have caused it

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 21, 2021

@jmg-duarte Here is how to reproduce:

  1. Set an environment variable PANDAS_DATA_MANAGER to the value array (how you do this depends on your O/S)
  2. pytest pandas/tests/io/json/test_json_table_schema_ext_dtype.py

I was able to reproduce on Windows.

@Dr-Irv Dr-Irv changed the title CI: Test experimental data manager not slow times out CI: Test experimental data manager failing due to new JSON format for ExtensionArray dtypes Dec 21, 2021
@bashtage
Copy link
Contributor

➜ pandas git:(master)✗  pytest pandas/tests/io/json/test_json_table_schema_ext_dtype.py -v
===================================== test session starts ======================================
platform win32 -- Python 3.9.6, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- C:\Anaconda\envs\pandas-dev\python.exe
cachedir: .pytest_cache
hypothesis profile 'ci' -> deadline=timedelta(milliseconds=500), suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase('C:\\git\\pandas\\.hypothesis\\examples')
rootdir: C:\git\pandas, configfile: pyproject.toml
plugins: hypothesis-6.23.2, forked-1.3.0, xdist-2.3.0
collected 19 items

pandas/tests/io/json/test_json_table_schema_ext_dtype.py::TestTableOrient::test_build_date_series Windows fatal exception: access violation

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 21, 2021

This program causes a failure on master, but I'm wondering if it is something inside the array manager:

import os

os.environ["PANDAS_DATA_MANAGER"] = "array"

import decimal
import pandas as pd
from pandas.tests.extension.decimal.array import DecimalArray

dc = DecimalArray([decimal.Decimal(10)])
s = pd.Series(dc, name="d")
s.index.name = "id"
result = s.to_json(orient="table", date_format="iso")

print(result)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 21, 2021

So the fundamental issue here is that for the JSON serialization to work with the array manager, subclasses of ExtensionArray have to have the method __array__() defined. If I take my example above, and add the following to pandas/tests/extension/decimal/array.py in the appropriate place, that example works:

    def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
        if is_object_dtype(dtype):
            return np.array(list(self), dtype=object)
        return np.asarray(list(self), dtype=dtype)

For any of the built-in ExtensionArray subclasses, we have done this. However, our ExtensionArray interface does not require subclasses of ExtensionArray to implement __array__() .

So I think we have a number of choices

  1. Modify the docs for ExtensionArray indicating that __array__() must be implemented
  2. Modify pandas/_libs/src/ujson/python/objToJSON.c to work with the array manager if __array__() is not defined for an ExtensionArray (See line 715 of that file)
  3. Modify the implementations of the ExtensionArray test classes in pandas/tests/extension/date/array.py and pandas/tests/extension/decimal/array.py to implement __array__()
  4. Consider no longer using our own implementation of json.dumps, and use the standard python implementation. This is probably a bigger change as we'd have to implement our own handler for our objects, but we could do it all in python instead of C.

@jbrockmendel
Copy link
Member

Modify the docs for ExtensionArray indicating that array() must be implemented

This is likely worth doing regardless. Several of the base class EA methods call np.asarray(self) or something similar.

Modify pandas/_libs/src/ujson/python/objToJSON.c to work with the array manager if __array__() is not defined for an ExtensionArray (See line 715 of that file)

-1 on putting more logic into those files.

Block.values_for_json is where the relevant logic lives for BM. Can implement the analogous logic in ArrayManager.column_arrays

@phofl phofl added this to the 1.4 milestone Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants