Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: Standardize values coercion during Block initialization #19492

Closed
TomAugspurger opened this issue Feb 1, 2018 · 10 comments · Fixed by #37009
Closed

CLN: Standardize values coercion during Block initialization #19492

TomAugspurger opened this issue Feb 1, 2018 · 10 comments · Fixed by #37009
Labels
Clean Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 1, 2018

Splitting off from #19268 (which is starting to give me unicorns fairly often)

After that, all blocks eventually end up calling Block.__init__ (aside from ScalarBlock, which we'll ignore).

Most of the subclasses init methods are

  1. maybe coerce values to the expected dtype
  2. call super().__init__

It'd be nice to put 1 into favor of a _maybe_coerce_values method defined by each block, and remove all the subclasses init methods.

The one sticking point will be DatetimeTZBlock.__init__, which accepts a dtype parameter that no other block does.

def __init__(self, values, placement, ndim=2, dtype=None):
if not isinstance(values, self._holder):
values = self._holder(values)
if dtype is not None:
if isinstance(dtype, compat.string_types):
dtype = DatetimeTZDtype.construct_from_string(dtype)
values = values._shallow_copy(tz=dtype.tz)

We could maybe push that onto the callers, and then clean things up.

@TomAugspurger TomAugspurger added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation Difficulty Intermediate Clean labels Feb 1, 2018
@jbrockmendel
Copy link
Member

It'd be nice to put 1 into favor of a _maybe_coerce_values method defined by each block, and remove all the subclasses init methods.

+1

The one sticking point will be DatetimeTZBlock.init, which accepts a dtype parameter that no other block does.

If dtype were added to Block.__init__ then Block.__new__ could be defined more or less as internals.make_block is now.

Side-note bc I don't want to pile on extraneous suggestions in #19268: the recent addition of _maybe_validate_ndim is nice. Would it be cleaner to include the if ndim is None: ndim = values.ndim inside the method and have it return ndim, so __init__ just starts with self.ndim = self._maybe_validate_ndim(ndim, values)? Pretty much an aesthetic consideration.

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Feb 1, 2018
@jorisvandenbossche
Copy link
Member

We also need to take into account that for 'real' extension blocks (not ones we internally still adapt a little bit), can never do such "maybe coerce" logic, as they infer the _holder from the passed values.

It would probably be cleaner to make sure for our internal extension blocks we also only pass it correct data (so no "maybe coerce" is needed), but that might require a lot of cleaning up internally / many difficult to find cases.

If dtype were added to Block.init then Block.new could be defined more or less as internals.make_block is now.

I would then rather don't add dtype everywhere, and make sure we use make_block if dtype is needed, and the dtype is processed inside make_block and Block gets then the correct values type.

@TomAugspurger
Copy link
Contributor Author

Re extension blocks (internal or external), I'm finding it might be helpful to define an ExtensionBlock._maybe_coerce_values that just unwraps Series/Indexes. Currently this is logic is in form_blocks, which isn't the best place.

@jbrockmendel
Copy link
Member

I would then rather don't add dtype everywhere

@jorisvandenbossche I don't want to de-rail the thread here, so we can discuss this in #19444 if necessary.

The motivation is that to make Frame/Series/Index arithmetic/comparison ops consistent (#18824) they will need to share a single implementation. Focusing on the datetime/timedelta[/period] cases, the idea is to adapt the index implementations since they perform type/timezone checks most carefully/correctly.

The sticking point is that they call Index/DatetimeIndex/TimedeltaIndex, which we will need to replace with something more generic if we're mixing the implementations into the blocks. The solution I've been kicking around is to define something like:

class Index(...):
    @property
    def _base_constructor(self):
        return Index

class Block(...):
    @property
    def _base_constructor(self):
        return Block

and replace e.g. occurrences of return TimedeltaIndex(values, ...) with return self._base_constructor(values, ..., dtype='timedelta64[ns]'). (I guess Block._base_constructor could return internals.make_block just as easily...)

@jorisvandenbossche
Copy link
Member

I guess Block._base_constructor could return internals.make_block just as easily...

That what I wanted to response when reading up to there :-)

@jbrockmendel
Copy link
Member

@TomAugspurger closeable? if not, actionable?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 7, 2019

I think it can still be worthwhile to clean up coercing logic in the Block constructors (but don't know if something already happened on that front since the above discussion).

@jbrockmendel
Copy link
Member

(but don't know if something already happened on that front since the above discussion).

Thats the point, I think this has been pushed about as far as it can go (until 2D EAs!)

@jorisvandenbossche
Copy link
Member

Can you explain what you mean? What has been changed since the discussion above? (link to PR?)

@TomAugspurger
Copy link
Contributor Author

I think this can be closed when all Block subclasses can remove their init and just reuse Block.__init__.

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants