Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: dtlike.astype(inty) #49715

Merged
merged 6 commits into from
Jan 11, 2023

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@@ -329,6 +329,7 @@ Other API changes
- Default value of ``dtype`` in :func:`get_dummies` is changed to ``bool`` from ``uint8`` (:issue:`45848`)
- :meth:`DataFrame.astype`, :meth:`Series.astype`, and :meth:`DatetimeIndex.astype` casting datetime64 data to any of "datetime64[s]", "datetime64[ms]", "datetime64[us]" will return an object with the given resolution instead of coercing back to "datetime64[ns]" (:issue:`48928`)
- :meth:`DataFrame.astype`, :meth:`Series.astype`, and :meth:`DatetimeIndex.astype` casting timedelta64 data to any of "timedelta64[s]", "timedelta64[ms]", "timedelta64[us]" will return an object with the given resolution instead of coercing to "float64" dtype (:issue:`48963`)
- :meth:`DatetimeIndex.astype`, :meth:`TimedeltaIndex.astype`, :meth:`PeriodIndex.astype` :meth:`Series.astype`, :meth:`DataFrame.astype` with ``datetime64``, ``timedelta64`` or :class:`PeriodDtype` dtypes no longer allow converting to integer dtypes other than "int64", do ``obj.astype('int64', copy=False).astype(dtype)`` instead (:issue:`??`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this right? eg you can use copy=False to actually do this? how so

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing copy=False here just avoids doing a double-copy, since the second .astype will always copy

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why an exception is only made for int64? (Assuming TimedeltaIndex.astype(int) is the nanos representation), TimedeltaIndex(["1ns"]) should be able to fit in a smaller int type

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
@mroeschke
Copy link
Member

Also a suggestion in a follow up PR: Could you collect all the "dtype conversion" whatsnew notes in the Other API section and give them their own dedicated section like under Other API -> Dtype Consistency/Strictness? I think it would be good to explain to users that pandas is trying to move toward dtype conversion operations that are less lossy and more explicit.

@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions Timedelta Timedelta data type Period Period data type Timestamp pd.Timestamp and associated methods labels Nov 16, 2022
@jbrockmendel
Copy link
Member Author

Curious why an exception is only made for int64? (Assuming TimedeltaIndex.astype(int) is the nanos representation), TimedeltaIndex(["1ns"]) should be able to fit in a smaller int type

My main concern is achieving consistency between Index/Series behavior. ATM Series raises for anything other than i8/u8 (see removed code in core.dtypes.astype). So the question is which way to go to achieve consistency. I prefer to go towards more-strict rather than less.

.astype in general means "the same information but in a different format" which does not hold for dtlike->inty. These are semantic gibberish 100% dependent on implementation details. Continuing to allow int64 is a compromise, though in a perfect world I'd tell users to just .view("i8") instead.

xref #45034, #38544, and parts of #22384.

@mroeschke
Copy link
Member

If I understand #45034 correctly, it's generally concluded that it's okay to disallow casting ints except for int64? cc @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

If I understand #45034 correctly, it's generally concluded that it's okay to disallow casting ints except for int64?

That is not my reading of that issue .. I mentioned several times there that I personally don't see why such restriction is needed (while I also said that I don't think this is very important, though)

It's also doing something different than what we actually said would change in the deprecation warnings ("In a future version, this astype will return exactly the specified dtype instead of int64, and will raise if that conversion overflows.")

Continuing to allow int64 is a compromise, though in a perfect world I'd tell users to just .view("i8") instead.

In a perfect world, there would be no view() method on pandas objects (if you ask me ;))

@jbrockmendel
Copy link
Member Author

It's also doing something different than what we actually said would change in the deprecation warnings

yah, i reconsidered and changed my mind about that deprecation bc it is value-dependent behavior and in most cases just a very weird thing to .astype to and likely to be unintentional.

@mroeschke
Copy link
Member

Personally I don't have a strong opinion on which direction to go as this astyping is strange. I could maybe interpret a user trying to do this to try to get an epoch timestamp/total seconds(?)/not sure what for Period.

Agreed that changing this to achieve consistency between Series/Index is important though. I suppose I also like the stricter Series behavior to just signal that this astyping is ill defined

@jbrockmendel
Copy link
Member Author

im now thinking the most likely case where a user would pass something other than int64 is if they pass int which on some builds gets interpreted as int32. In this case they almost certainly wanted int64, and giving them int32 is going to throw them off (and make an unwanted copy).

we should disallow anything other than int64 (which is still more than pyarrow allows)

@jreback
Copy link
Contributor

jreback commented Dec 29, 2022

I am with @jbrockmendel here, I think we should only allow int64, this is super problematic with overflow and especially on windows.

@phofl
Copy link
Member

phofl commented Dec 31, 2022

I think int64 makes the most sense, int32 overflows real fast. I had similar issues to what @jbrockmendel mentioned while building a windows service in the past, this was super annoying. Only allowing int64 increases usability imo

This was referenced Jan 5, 2023
@jbrockmendel
Copy link
Member Author

@MarcoGorelli thoughts here?

@MarcoGorelli
Copy link
Member

@MarcoGorelli thoughts here?

No objections (no strong opinion either though, this isn't a topic I've delved into yet)

@mroeschke mroeschke added this to the 2.0 milestone Jan 9, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as is. Although it's slightly different from the original deprecation message the exception at least gives an alternative to convert to another int type

@mroeschke mroeschke merged commit 279138c into pandas-dev:main Jan 11, 2023
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the api-astype-dtlike-to-ints branch January 12, 2023 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Dtype Conversions Unexpected or buggy dtype conversions Period Period data type Timedelta Timedelta data type Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants