JSON table orient not roundtripping extension types #32037

navado · 2020-02-16T07:43:55Z

Code Sample, a copy-pastable example if possible

import pandas
print(pandas._version.get_versions())
df = pandas.DataFrame(
    {
        1: ['v1','v2'],
        2: ['v3','v4']
    },dtype='string'
)

print(df.info())
for orient in ['table','split','records','values']:
    rdf = pandas.read_json(df.to_json(orient=orient), orient=orient)
    print(f'======{orient}======')
    print(rdf.info())

Problem description

string dtype not preserved with round trip serialization to JSON, so dataframes containing strings cannot be reused transparently

Expected Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      string
 1   2       2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
None

Actual output

======table======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       0 non-null      object
 1   2       0 non-null      object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======split======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      object
 1   2       2 non-null      object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======records======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      object
 1   2       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None
======values======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       2 non-null      object
 1   1       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None

Output of `pd.show_versions()`

{'dirty': False, 'error': None, 'full-revisionid': '29d6b0232aab9576afa896ff5bab0b994760495a', 'version': '1.0.1'}

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2020-02-16T10:37:35Z

Thanks @navado

I don't believe DataFrame.to_json was ever intended to store metadata. Have you tried with to_parquet?

>>> df = pd.DataFrame({'a': ['a']}, dtype='string')                         
>>> df                                                                      
   a
0  a

>>> df.info()                                                               
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       1 non-null      string
dtypes: string(1)
memory usage: 136.0 bytes

>>> df.to_parquet('temp.pq')                                                
>>> pd.read_parquet('temp.pq')                                              
   a
0  a

>>> pd.read_parquet('temp.pq').info()                                       
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       1 non-null      string
dtypes: string(1)
memory usage: 136.0 bytes

WillAyd · 2020-02-16T18:36:42Z

Have you tried orient=“table”? That does store metadata about the dtypes so is the best bet for round tripping

That said, better inferencing is possible generally so I think would accept a PR to do this

navado · 2020-02-16T20:50:44Z

@WillAyd - yes, it's the first print in the issue.

@MarcoGorelli - to_parquet (and most of other to_smth) functions require dealing with storage when to_json is easily applicable to prepare dataframe for sending over the network.

In general it's kind'a unexpected that metadata is not stored in such flexible format as JSON. It's not a compatibility issue (or I'm not aware of such).

Additionally, there are several more issues with round trip serialization of JSON that I realized only now:

Size of de-serialized dataframe is differ from the original and varies depending on the orientation (see table and split vs the rest)
Columns numbering differs between orientation (see prints for values vs the rest)

WillAyd · 2020-02-16T21:40:55Z

Ah sorry misread that. At the very least orient="table" should be able to preserve this, so would certainly take a PR to fix that

Not sure on the rest just yet - we for most IO right now don't infer extension types like "string" automatically but it is something we have talked about as a team.

@jorisvandenbossche

jorisvandenbossche · 2020-02-16T21:47:02Z

As long as the string dtype is not the default, we can't infer it from a file format that has no type information (like json). There has been discussion about adding a keyword to specify you want the new dtypes, see #29752.

Specifically for the JSON table format: that format indeed includes some type information. However, pandas itself doesn't specify the spec, I think, but we follow https://frictionlessdata.io/specs/table-schema/ ? (no specialist here)
If that is the case, we still can't distinguish the two cases.

WillAyd · 2020-02-16T22:06:17Z

Table should work - we actually write that it is a string dtype just something off with the reader

jorisvandenbossche · 2020-02-17T07:15:35Z

So nullable boolean and integer might work, indeed (but only if nulls are present, otherwise we still can't know which type to use).

But as explained above, I don't see how we can differentiate string dtype vs object dtype with strings. Both end up in the json schema as type "string". The types are not pandas specific, but following the table-schema specification.

navado · 2020-02-17T09:49:22Z

Probably JSON serialization should be AVRO based or resembling AVRO? i.e. solve the issue by writing schema information to the file with the data?

jorisvandenbossche · 2020-02-17T10:52:37Z

An AVRO interface for pandas might be interesting, but I think that is a different discussion? (or would be a new format)

The current json output formats do not include schema information, except for the table schema. And for that, there is a specification to follow.

navado · 2020-02-17T19:36:12Z

@jorisvandenbossche Avro is definitely different topic, but saving schema in serialized JSON - should not be a big problem

jorisvandenbossche · 2020-02-18T07:51:08Z

Saving schema information in the JSON outputs is something that could certainly be discussed, but I just wanted point out that this is still something else / bigger than how the current issue is framed here as "table orient not correctly roundtripping extension types"

So if we want to discuss this further, that warrants a separate issue I think.

navado · 2020-02-18T08:27:56Z

Or just revert renaming the issue :)

navado · 2020-02-18T13:37:30Z

Well, back to the problem.
Just discovered, that with orient='table' the schema is written to JSON, but it seems it's silently ignored inside DataFrame constructor

import pandas as pd
df = pd.DataFrame(
        {'A': [1, 2, 3],
         'B': ['a', 'b', 'c'],
         'C': pd.date_range('2016-01-01', freq='d', periods=3),
         }, index=pd.Index(range(3), name='idx'))
json = df.to_json(orient='table')
print(json)

{"schema":{"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"},"data":[{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}

jorisvandenbossche · 2020-02-18T13:48:59Z

What do you mean with "being ignored inside the DataFrame constructor"?

navado · 2020-02-18T14:43:16Z

No, sorry, Just looked wrong place. It's ignored in case of numpy based parsing which is as I understand is deprecated.
So the conversion of string to object is completely intentional.
pandas/io/json/_table_schema.py#L164 says:

def convert_json_field_to_pandas_type(field):
...
    if typ == "string":
        return "object"
...

jorisvandenbossche · 2020-02-19T08:24:00Z

So the conversion of string to object is completely intentional.
pandas/io/json/_table_schema.py#L164 says:

Yes, because that is the default dtype for strings. As mentioned above, there has been discussion to add a keyword to opt in (#29752). But currently the schema information that is saved in json table schema is not detailed enough to know the original pandas dtype to do a faithful roundtrip.

navado · 2020-02-19T08:30:37Z

Why not something like

----------------------- pandas/io/json/_table_schema.py -----------------------
index 4e42533ca..b9c963c67 100644
@@ -70,6 +70,8 @@ def as_json_table_type(x):
         return "any"
     elif is_string_dtype(x):
         return "string"
+    elif is_object_dtype(x):
+        return "object"
     else:
         return "any"
 
@@ -162,6 +164,8 @@ def convert_json_field_to_pandas_type(field):
     """
     typ = field["type"]
     if typ == "string":
+        return "string"
+    if typ == "object":
         return "object"
     elif typ == "integer":
         return "int64"

jorisvandenbossche · 2020-02-19T08:40:16Z

That might be a backwards change for people using strings in object dtypes (now, I almost never use this myself, so I can't really assess possible impact).

But, what might be a better way forward is to add an additional entry to the field description. The spec allows to add more entries (https://frictionlessdata.io/specs/table-schema/#field-descriptors). And we already do this for eg categorical (ordered keyword) and datetimes (tz keyword).
So we could think about some storing some extra information that can better preserve the dtype.

WillAyd · 2020-02-19T14:06:01Z

I think this has been a pretty enlightening discussion. I for one didn't realize that we just wrote "string" in to_json for object dtypes. Makes sense from a historical perspective, but actually there are a lot of gaps in that design like this:

# Writes string for the dtype, but doesn't write strings in JSON
>>> pd.Series([1], dtype=object).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":1}]}'

And this

# Same as above, but doesn't preserve key anyway
>>> pd.Series([{1: 2}]).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":{"1":2}}]}'

Would be tough to manage backwards compat but would support using string on the writing side appropriately and perhaps separate metadata for object dtypes, which would probably solve a lot of issues in round tripping

Dr-Irv · 2021-12-01T14:40:38Z

At least for some of the built-in extension dtypes, you can use convert_dtypes() to do the conversion:

import pandas
print(pandas._version.get_versions())
df = pandas.DataFrame(
    {
        1: ['v1','v2'],
        2: ['v3','v4']
    },dtype='string'
)

print(df.info())
for orient in ['table','split','records','values']:
    rdf = pandas.read_json(df.to_json(orient=orient), orient=orient).convert_dtypes()
    print(f'======{orient}======')
    print(rdf.info())

produces

======table======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1       0 non-null      Int64
 1   2       0 non-null      Int64
dtypes: Int64(2)
memory usage: 52.0 bytes
None
======split======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1       2 non-null      string
 1   2       2 non-null      string
dtypes: string(2)
memory usage: 48.0 bytes
None
======records======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1       2 non-null      string
 1   2       2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
None
======values======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       2 non-null      string
 1   1       2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
None

jmg-duarte · 2021-12-01T15:57:06Z

@jorisvandenbossche moving part of the discussion here and addressing your last point in #20612 (comment)

Not only I am interested in making this work for dates, I also need it to.
I would like to know what I can do to make this happen

jorisvandenbossche · 2021-12-01T15:58:33Z

Maybe you can create a new issue specifically about supporting "date" for JSON table schema? (because this issue is still a bit more general about other extension types as well. Unless you would like to tackle the general issue for extension types, instead of first focusing on datetime.date?)

Dr-Irv · 2021-12-01T16:03:27Z

Maybe you can create a new issue specifically about supporting "date" for JSON table schema? (because this issue is still a bit more general about other extension types as well. Unless you would like to tackle the general issue for extension types, instead of first focusing on datetime.date?)

I agree with this. I have some comments/questions in response to #20612 (comment) that would be better handled in a separate issue about dates.

WillAyd added the IO JSON read_json, to_json, json_normalize label Feb 16, 2020

WillAyd changed the title ~~Roundtrip JSON serialization doesn't preserve string type~~ JSON table orient not roundtripping extension types Feb 16, 2020

WillAyd added this to the Contributions Welcome milestone Feb 16, 2020

WillAyd added the ExtensionArray Extending pandas with custom dtypes or arrays. label Feb 16, 2020

mroeschke added the Enhancement label May 8, 2020

simonjayhawkins mentioned this issue Jul 27, 2020

to_json/read_json can't handle interval index #35420

Open

benkrikler mentioned this issue Aug 10, 2020

BUG: IntervalIndex cannot be round-tripped pickled #35658

Closed

3 tasks

jorisvandenbossche mentioned this issue Dec 1, 2021

Serialization / Deserialization of ExtensionArrays #20612

Open

jmg-duarte mentioned this issue Dec 1, 2021

BUG: Python's date and time objects do not roundtrip with JSON orient='table' #44705

Closed

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel mentioned this issue Oct 6, 2023

REF/API: EA formatting methods #55426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON table orient not roundtripping extension types #32037

JSON table orient not roundtripping extension types #32037

navado commented Feb 16, 2020

MarcoGorelli commented Feb 16, 2020 •

edited

Loading

WillAyd commented Feb 16, 2020

navado commented Feb 16, 2020 •

edited

Loading

WillAyd commented Feb 16, 2020

jorisvandenbossche commented Feb 16, 2020

WillAyd commented Feb 16, 2020

jorisvandenbossche commented Feb 17, 2020

navado commented Feb 17, 2020

jorisvandenbossche commented Feb 17, 2020

navado commented Feb 17, 2020

jorisvandenbossche commented Feb 18, 2020

navado commented Feb 18, 2020

navado commented Feb 18, 2020

jorisvandenbossche commented Feb 18, 2020

navado commented Feb 18, 2020

jorisvandenbossche commented Feb 19, 2020

navado commented Feb 19, 2020

jorisvandenbossche commented Feb 19, 2020

WillAyd commented Feb 19, 2020

Dr-Irv commented Dec 1, 2021

jmg-duarte commented Dec 1, 2021

jorisvandenbossche commented Dec 1, 2021 •

edited

Loading

Dr-Irv commented Dec 1, 2021

JSON table orient not roundtripping extension types #32037

JSON table orient not roundtripping extension types #32037

Comments

navado commented Feb 16, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Actual output

Output of pd.show_versions()

MarcoGorelli commented Feb 16, 2020 • edited Loading

WillAyd commented Feb 16, 2020

navado commented Feb 16, 2020 • edited Loading

WillAyd commented Feb 16, 2020

jorisvandenbossche commented Feb 16, 2020

WillAyd commented Feb 16, 2020

jorisvandenbossche commented Feb 17, 2020

navado commented Feb 17, 2020

jorisvandenbossche commented Feb 17, 2020

navado commented Feb 17, 2020

jorisvandenbossche commented Feb 18, 2020

navado commented Feb 18, 2020

navado commented Feb 18, 2020

jorisvandenbossche commented Feb 18, 2020

navado commented Feb 18, 2020

jorisvandenbossche commented Feb 19, 2020

navado commented Feb 19, 2020

jorisvandenbossche commented Feb 19, 2020

WillAyd commented Feb 19, 2020

Dr-Irv commented Dec 1, 2021

jmg-duarte commented Dec 1, 2021

jorisvandenbossche commented Dec 1, 2021 • edited Loading

Dr-Irv commented Dec 1, 2021

Output of `pd.show_versions()`

MarcoGorelli commented Feb 16, 2020 •

edited

Loading

navado commented Feb 16, 2020 •

edited

Loading

jorisvandenbossche commented Dec 1, 2021 •

edited

Loading