Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow JSON (de)serialization of ExtensionDtypes #44722

Merged
merged 7 commits into from
Dec 19, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/development/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ As an example of fully-formed metadata:
'numpy_type': 'int64',
'metadata': None}
],
'pandas_version': '0.20.0',
'pandas_version': '1.4.0',
'creator': {
'library': 'pyarrow',
'version': '0.13.0'
Expand Down
17 changes: 17 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1903,6 +1903,7 @@ with optional parameters:
``index``; dict like {index -> {column -> value}}
``columns``; dict like {column -> {index -> value}}
``values``; just the values array
``table``; adhering to the JSON `Table Schema`_

* ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601.
* ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10.
Expand All @@ -1919,6 +1920,18 @@ Note ``NaN``'s, ``NaT``'s and ``None`` will be converted to ``null`` and ``datet
json = dfj.to_json()
json

.. note::

When using ``orient='table'`` along with user-defined ``ExtensionArray``,
the generated schema will contain an additional ``extDtype`` key in the respective
``fields`` element. This extra key is not standard but does enable JSON roundtrips
for extension types (e.g. ``read_json(df.to_json(orient="table"), orient="table")``).

The ``extDtype`` key carries the name of the extension, if you have properly registered
the ``ExtensionDtype``, pandas will use said name to perform a lookup into the registry
and re-convert the serialized data into your custom dtype.
jmg-duarte marked this conversation as resolved.
Show resolved Hide resolved


Orient options
++++++++++++++

Expand Down Expand Up @@ -2477,6 +2490,10 @@ A few notes on the generated table schema:
* For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
then ``level_<i>`` is used.

* When using a ``DataFrame`` containing a ``Series`` backed by a used-defined
jmg-duarte marked this conversation as resolved.
Show resolved Hide resolved
``ExtensionArray``, the generated JSON will contain an extra ``extDtype``
key under the respective ``fields`` array element. While this key is not standard
it enables roundtripping for custom types (e.g. ``read_json(df.to_json(orient="table"), orient="table")``).
jmg-duarte marked this conversation as resolved.
Show resolved Hide resolved

``read_json`` also accepts ``orient='table'`` as an argument. This allows for
the preservation of metadata such as dtypes and index names in a
Expand Down
1 change: 0 additions & 1 deletion doc/source/whatsnew/v1.3.5.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ Fixed regressions
Bug fixes
~~~~~~~~~
-
-

.. ---------------------------------------------------------------------------

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,7 @@ Other enhancements
- :meth:`Series.info` has been added, for compatibility with :meth:`DataFrame.info` (:issue:`5167`)
- Implemented :meth:`IntervalArray.min`, :meth:`IntervalArray.max`, as a result of which ``min`` and ``max`` now work for :class:`IntervalIndex`, :class:`Series` and :class:`DataFrame` with ``IntervalDtype`` (:issue:`44746`)
- :meth:`UInt64Index.map` now retains ``dtype`` where possible (:issue:`44609`)
- :class:`ExtensionDtype` and :class:`ExtensionArray` are now (de)serialized when exporting a :class:`DataFrame` with :meth:`DataFrame.to_json` using ``orient='table'`` (:issue:`20612`, :issue:`44705`).
-


Expand Down
7 changes: 6 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2444,6 +2444,11 @@ def to_json(
``orient='table'`` contains a 'pandas_version' field under 'schema'.
This stores the version of `pandas` used in the latest revision of the
schema.
When using :class:`ExtensionDtype`-kind columns, the schema fields will
jmg-duarte marked this conversation as resolved.
Show resolved Hide resolved
carry 'extDtype', this field stores the :class:`ExtensionDtype` name
and is used to resolve the correct dtype during deserialization.
This procedure is handled by the :class:`ExtensionDtype`'s
:func:`_from_sequence` method.

Examples
--------
Expand Down Expand Up @@ -2567,7 +2572,7 @@ def to_json(
"primaryKey": [
"index"
],
"pandas_version": "0.20.0"
"pandas_version": "1.4.0"
}},
"data": [
{{
Expand Down
4 changes: 1 addition & 3 deletions pandas/io/json/_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,6 @@
loads = json.loads
dumps = json.dumps

TABLE_SCHEMA_VERSION = "0.20.0"
jreback marked this conversation as resolved.
Show resolved Hide resolved


# interface to/from
def to_json(
Expand Down Expand Up @@ -565,7 +563,7 @@ def read_json(
{{"name":"col 1","type":"string"}},\
{{"name":"col 2","type":"string"}}],\
"primaryKey":["index"],\
"pandas_version":"0.20.0"}},\
"pandas_version":"1.4.0"}},\
"data":[\
{{"index":"row 1","col 1":"a","col 2":"b"}},\
{{"index":"row 2","col 1":"c","col 2":"d"}}]\
Expand Down
14 changes: 12 additions & 2 deletions pandas/io/json/_table_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@
JSONSerializable,
)

from pandas.core.dtypes.base import _registry as registry
from pandas.core.dtypes.common import (
is_bool_dtype,
is_categorical_dtype,
is_datetime64_dtype,
is_datetime64tz_dtype,
is_extension_array_dtype,
is_integer_dtype,
is_numeric_dtype,
is_period_dtype,
Expand All @@ -40,6 +42,8 @@

loads = json.loads

TABLE_SCHEMA_VERSION = "1.4.0"


def as_json_table_type(x: DtypeObj) -> str:
"""
Expand Down Expand Up @@ -83,6 +87,8 @@ def as_json_table_type(x: DtypeObj) -> str:
return "duration"
elif is_categorical_dtype(x):
return "any"
if is_extension_array_dtype(x):
jmg-duarte marked this conversation as resolved.
Show resolved Hide resolved
return "any"
elif is_string_dtype(x):
return "string"
else:
Expand Down Expand Up @@ -134,6 +140,8 @@ def convert_pandas_type_to_json_field(arr):
field["freq"] = dtype.freq.freqstr
elif is_datetime64tz_dtype(dtype):
field["tz"] = dtype.tz.zone
elif is_extension_array_dtype(dtype):
field["extDtype"] = dtype.name
return field


Expand Down Expand Up @@ -199,6 +207,8 @@ def convert_json_field_to_pandas_type(field):
return CategoricalDtype(
categories=field["constraints"]["enum"], ordered=field["ordered"]
)
if "extDtype" in field:
jmg-duarte marked this conversation as resolved.
Show resolved Hide resolved
return registry.find(field["extDtype"])
else:
return "object"

Expand Down Expand Up @@ -257,7 +267,7 @@ def build_table_schema(
{'name': 'B', 'type': 'string'}, \
{'name': 'C', 'type': 'datetime'}], \
'primaryKey': ['idx'], \
'pandas_version': '0.20.0'}
'pandas_version': '1.4.0'}
"""
if index is True:
data = set_default_names(data)
Expand Down Expand Up @@ -291,7 +301,7 @@ def build_table_schema(
schema["primaryKey"] = primary_key

if version:
schema["pandas_version"] = "0.20.0"
schema["pandas_version"] = TABLE_SCHEMA_VERSION
return schema


Expand Down
7 changes: 5 additions & 2 deletions pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,11 @@ class DecimalArray(OpsMixin, ExtensionScalarOpsMixin, ExtensionArray):

def __init__(self, values, dtype=None, copy=False, context=None):
for i, val in enumerate(values):
if is_float(val) and np.isnan(val):
values[i] = DecimalDtype.na_value
if is_float(val):
jmg-duarte marked this conversation as resolved.
Show resolved Hide resolved
if np.isnan(val):
values[i] = DecimalDtype.na_value
else:
values[i] = DecimalDtype.type(val)
elif not isinstance(val, decimal.Decimal):
raise TypeError("All values must be of type " + str(decimal.Decimal))
values = np.asarray(values, dtype=object)
Expand Down
Loading