Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON table orient not roundtripping extension types #32037

Open
navado opened this issue Feb 16, 2020 · 23 comments
Open

JSON table orient not roundtripping extension types #32037

navado opened this issue Feb 16, 2020 · 23 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. IO JSON read_json, to_json, json_normalize

Comments

@navado
Copy link

navado commented Feb 16, 2020

Code Sample, a copy-pastable example if possible

import pandas
print(pandas._version.get_versions())
df = pandas.DataFrame(
    {
        1: ['v1','v2'],
        2: ['v3','v4']
    },dtype='string'
)

print(df.info())
for orient in ['table','split','records','values']:
    rdf = pandas.read_json(df.to_json(orient=orient), orient=orient)
    print(f'======{orient}======')
    print(rdf.info())

Problem description

string dtype not preserved with round trip serialization to JSON, so dataframes containing strings cannot be reused transparently

Expected Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      string
 1   2       2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
None

Actual output

======table======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       0 non-null      object
 1   2       0 non-null      object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======split======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      object
 1   2       2 non-null      object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======records======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      object
 1   2       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None
======values======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       2 non-null      object
 1   1       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None

Output of pd.show_versions()

{'dirty': False, 'error': None, 'full-revisionid': '29d6b0232aab9576afa896ff5bab0b994760495a', 'version': '1.0.1'}
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 16, 2020

Thanks @navado

I don't believe DataFrame.to_json was ever intended to store metadata. Have you tried with to_parquet?

>>> df = pd.DataFrame({'a': ['a']}, dtype='string')                         
>>> df                                                                      
   a
0  a

>>> df.info()                                                               
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       1 non-null      string
dtypes: string(1)
memory usage: 136.0 bytes

>>> df.to_parquet('temp.pq')                                                
>>> pd.read_parquet('temp.pq')                                              
   a
0  a

>>> pd.read_parquet('temp.pq').info()                                       
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       1 non-null      string
dtypes: string(1)
memory usage: 136.0 bytes

@WillAyd
Copy link
Member

WillAyd commented Feb 16, 2020

Have you tried orient=“table”? That does store metadata about the dtypes so is the best bet for round tripping

That said, better inferencing is possible generally so I think would accept a PR to do this

@navado
Copy link
Author

navado commented Feb 16, 2020

@WillAyd - yes, it's the first print in the issue.

@MarcoGorelli - to_parquet (and most of other to_smth) functions require dealing with storage when to_json is easily applicable to prepare dataframe for sending over the network.

In general it's kind'a unexpected that metadata is not stored in such flexible format as JSON. It's not a compatibility issue (or I'm not aware of such).

Additionally, there are several more issues with round trip serialization of JSON that I realized only now:

  • Size of de-serialized dataframe is differ from the original and varies depending on the orientation (see table and split vs the rest)
  • Columns numbering differs between orientation (see prints for values vs the rest)

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Feb 16, 2020
@WillAyd
Copy link
Member

WillAyd commented Feb 16, 2020

Ah sorry misread that. At the very least orient="table" should be able to preserve this, so would certainly take a PR to fix that

Not sure on the rest just yet - we for most IO right now don't infer extension types like "string" automatically but it is something we have talked about as a team.

@jorisvandenbossche

@WillAyd WillAyd changed the title Roundtrip JSON serialization doesn't preserve string type JSON table orient not roundtripping extension types Feb 16, 2020
@WillAyd WillAyd added this to the Contributions Welcome milestone Feb 16, 2020
@WillAyd WillAyd added the ExtensionArray Extending pandas with custom dtypes or arrays. label Feb 16, 2020
@jorisvandenbossche
Copy link
Member

As long as the string dtype is not the default, we can't infer it from a file format that has no type information (like json). There has been discussion about adding a keyword to specify you want the new dtypes, see #29752.

Specifically for the JSON table format: that format indeed includes some type information. However, pandas itself doesn't specify the spec, I think, but we follow https://frictionlessdata.io/specs/table-schema/ ? (no specialist here)
If that is the case, we still can't distinguish the two cases.

@WillAyd
Copy link
Member

WillAyd commented Feb 16, 2020

Table should work - we actually write that it is a string dtype just something off with the reader

@jorisvandenbossche
Copy link
Member

So nullable boolean and integer might work, indeed (but only if nulls are present, otherwise we still can't know which type to use).

But as explained above, I don't see how we can differentiate string dtype vs object dtype with strings. Both end up in the json schema as type "string". The types are not pandas specific, but following the table-schema specification.

@navado
Copy link
Author

navado commented Feb 17, 2020

Probably JSON serialization should be AVRO based or resembling AVRO? i.e. solve the issue by writing schema information to the file with the data?

@jorisvandenbossche
Copy link
Member

An AVRO interface for pandas might be interesting, but I think that is a different discussion? (or would be a new format)

The current json output formats do not include schema information, except for the table schema. And for that, there is a specification to follow.

@navado
Copy link
Author

navado commented Feb 17, 2020

@jorisvandenbossche Avro is definitely different topic, but saving schema in serialized JSON - should not be a big problem

@jorisvandenbossche
Copy link
Member

Saving schema information in the JSON outputs is something that could certainly be discussed, but I just wanted point out that this is still something else / bigger than how the current issue is framed here as "table orient not correctly roundtripping extension types"

So if we want to discuss this further, that warrants a separate issue I think.

@navado
Copy link
Author

navado commented Feb 18, 2020

Or just revert renaming the issue :)

@navado
Copy link
Author

navado commented Feb 18, 2020

Well, back to the problem.
Just discovered, that with orient='table' the schema is written to JSON, but it seems it's silently ignored inside DataFrame constructor

import pandas as pd
df = pd.DataFrame(
        {'A': [1, 2, 3],
         'B': ['a', 'b', 'c'],
         'C': pd.date_range('2016-01-01', freq='d', periods=3),
         }, index=pd.Index(range(3), name='idx'))
json = df.to_json(orient='table')
print(json)

{"schema":{"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"},"data":[{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}

@jorisvandenbossche
Copy link
Member

What do you mean with "being ignored inside the DataFrame constructor"?

@navado
Copy link
Author

navado commented Feb 18, 2020

No, sorry, Just looked wrong place. It's ignored in case of numpy based parsing which is as I understand is deprecated.
So the conversion of string to object is completely intentional.
pandas/io/json/_table_schema.py#L164 says:

def convert_json_field_to_pandas_type(field):
...
    if typ == "string":
        return "object"
...

@jorisvandenbossche
Copy link
Member

So the conversion of string to object is completely intentional.
pandas/io/json/_table_schema.py#L164 says:

Yes, because that is the default dtype for strings. As mentioned above, there has been discussion to add a keyword to opt in (#29752). But currently the schema information that is saved in json table schema is not detailed enough to know the original pandas dtype to do a faithful roundtrip.

@navado
Copy link
Author

navado commented Feb 19, 2020

Why not something like

----------------------- pandas/io/json/_table_schema.py -----------------------
index 4e42533ca..b9c963c67 100644
@@ -70,6 +70,8 @@ def as_json_table_type(x):
         return "any"
     elif is_string_dtype(x):
         return "string"
+    elif is_object_dtype(x):
+        return "object"
     else:
         return "any"
 
@@ -162,6 +164,8 @@ def convert_json_field_to_pandas_type(field):
     """
     typ = field["type"]
     if typ == "string":
+        return "string"
+    if typ == "object":
         return "object"
     elif typ == "integer":
         return "int64"

@jorisvandenbossche
Copy link
Member

That might be a backwards change for people using strings in object dtypes (now, I almost never use this myself, so I can't really assess possible impact).

But, what might be a better way forward is to add an additional entry to the field description. The spec allows to add more entries (https://frictionlessdata.io/specs/table-schema/#field-descriptors). And we already do this for eg categorical (ordered keyword) and datetimes (tz keyword).
So we could think about some storing some extra information that can better preserve the dtype.

@WillAyd
Copy link
Member

WillAyd commented Feb 19, 2020

I think this has been a pretty enlightening discussion. I for one didn't realize that we just wrote "string" in to_json for object dtypes. Makes sense from a historical perspective, but actually there are a lot of gaps in that design like this:

# Writes string for the dtype, but doesn't write strings in JSON
>>> pd.Series([1], dtype=object).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":1}]}'

And this

# Same as above, but doesn't preserve key anyway
>>> pd.Series([{1: 2}]).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":{"1":2}}]}'

Would be tough to manage backwards compat but would support using string on the writing side appropriately and perhaps separate metadata for object dtypes, which would probably solve a lot of issues in round tripping

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 1, 2021

At least for some of the built-in extension dtypes, you can use convert_dtypes() to do the conversion:

import pandas
print(pandas._version.get_versions())
df = pandas.DataFrame(
    {
        1: ['v1','v2'],
        2: ['v3','v4']
    },dtype='string'
)

print(df.info())
for orient in ['table','split','records','values']:
    rdf = pandas.read_json(df.to_json(orient=orient), orient=orient).convert_dtypes()
    print(f'======{orient}======')
    print(rdf.info())

produces

======table======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1       0 non-null      Int64
 1   2       0 non-null      Int64
dtypes: Int64(2)
memory usage: 52.0 bytes
None
======split======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1       2 non-null      string
 1   2       2 non-null      string
dtypes: string(2)
memory usage: 48.0 bytes
None
======records======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1       2 non-null      string
 1   2       2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
None
======values======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       2 non-null      string
 1   1       2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
None

@jmg-duarte
Copy link
Contributor

@jorisvandenbossche moving part of the discussion here and addressing your last point in #20612 (comment)

Not only I am interested in making this work for dates, I also need it to.
I would like to know what I can do to make this happen

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 1, 2021

Maybe you can create a new issue specifically about supporting "date" for JSON table schema? (because this issue is still a bit more general about other extension types as well. Unless you would like to tackle the general issue for extension types, instead of first focusing on datetime.date?)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 1, 2021

Maybe you can create a new issue specifically about supporting "date" for JSON table schema? (because this issue is still a bit more general about other extension types as well. Unless you would like to tackle the general issue for extension types, instead of first focusing on datetime.date?)

I agree with this. I have some comments/questions in response to #20612 (comment) that would be better handled in a separate issue about dates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

7 participants