-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: enable in SQL IO + resolve all xfails #60255
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,6 +45,8 @@ | |
from pandas.core.dtypes.common import ( | ||
is_dict_like, | ||
is_list_like, | ||
is_object_dtype, | ||
is_string_dtype, | ||
) | ||
from pandas.core.dtypes.dtypes import ( | ||
ArrowDtype, | ||
|
@@ -58,6 +60,7 @@ | |
Series, | ||
) | ||
from pandas.core.arrays import ArrowExtensionArray | ||
from pandas.core.arrays.string_ import StringDtype | ||
from pandas.core.base import PandasObject | ||
import pandas.core.common as com | ||
from pandas.core.common import maybe_make_list | ||
|
@@ -1316,7 +1319,12 @@ def _harmonize_columns( | |
elif dtype_backend == "numpy" and col_type is float: | ||
# floats support NA, can always convert! | ||
self.frame[col_name] = df_col.astype(col_type) | ||
|
||
elif ( | ||
using_string_dtype() | ||
and is_string_dtype(col_type) | ||
and is_object_dtype(self.frame[col_name]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the check for object dtype of the column needed? (or the string dtype might already have been inferred when constructing the DataFrame?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And for example a column of all missing values, that might get inferred as float by the DataFrame constructor? In that case maybe rather There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is needed, although admittedly the patterns used here are not very clear. Without the As an alternative, we might be able to infer the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it would be clearer then to use (this also makes me wonder if we should call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to revert this to using The problem is showing up in the latest CI failure because of the NUMERIC type in postgres, which is akin to the decimal data type in Python. The current SQL code treats that as an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this branch get activated in that case of decimals if there is a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because >>> from pandas.core.dtypes.common import is_string_dtype
>>> is_string_dtype(object)
True is that not expected? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes, I constantly forget that Ideally we have a check that just passes for actual string dtypes, but let's leave that for another time. |
||
): | ||
self.frame[col_name] = df_col.astype(col_type) | ||
elif dtype_backend == "numpy" and len(df_col) == df_col.count(): | ||
# No NA values, can convert ints and bools | ||
if col_type is np.dtype("int64") or col_type is bool: | ||
|
@@ -1403,6 +1411,7 @@ def _get_dtype(self, sqltype): | |
DateTime, | ||
Float, | ||
Integer, | ||
String, | ||
) | ||
|
||
if isinstance(sqltype, Float): | ||
|
@@ -1422,6 +1431,10 @@ def _get_dtype(self, sqltype): | |
return date | ||
elif isinstance(sqltype, Boolean): | ||
return bool | ||
elif isinstance(sqltype, String): | ||
if using_string_dtype(): | ||
return StringDtype(na_value=np.nan) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hard-coded the nan value here for backwards compatability, but this might become a code smell in the future. Have we been trying to avoid this pattern @jorisvandenbossche or is it expected for now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this is the way I also did it elsewhere in code paths that needs to construct the default string dtype (harcode the nan value, and the storage gets determined automatically based on whether pyarrow is installed / based on the pd.options) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But, is this change required? It doesn't seem that this is actually used in the one place where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is required - I somehow did not commit the part that uses it in the caller. Stay tuned... |
||
|
||
return object | ||
|
||
|
||
|
@@ -2205,7 +2218,7 @@ def read_table( | |
elif using_string_dtype(): | ||
from pandas.io._util import arrow_string_types_mapper | ||
|
||
arrow_string_types_mapper() | ||
mapping = arrow_string_types_mapper() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. whoops :) |
||
else: | ||
mapping = None | ||
|
||
|
@@ -2286,6 +2299,10 @@ def read_query( | |
from pandas.io._util import _arrow_dtype_mapping | ||
|
||
mapping = _arrow_dtype_mapping().get | ||
elif using_string_dtype(): | ||
from pandas.io._util import arrow_string_types_mapper | ||
|
||
mapping = arrow_string_types_mapper() | ||
else: | ||
mapping = None | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised by the fact that
maybe_infer_to_datetimelike
would change an object dtype to a string datatype; might be a cause of unexpected behavior elsewhere where this is usedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a very confusing name for what it does (by accident?) in practice nowadays ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not necessarily for now, but long term I think we should directly use
maybe_convert_objects
here instead ofmaybe_infer_to_datimelike
as a small wrapper around it)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - I opened #60258 to keep track of this