Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: combine_first does not retain dtypes with Timestamp DataFrames #38145

Merged
merged 23 commits into from
Nov 30, 2020

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Nov 29, 2020

Picking up #35514

@jbrockmendel
Copy link
Member

Why does combin_first use expressions.where instead of mgr.where? Casting should be handled correctly there

@arw2019
Copy link
Member Author

arw2019 commented Nov 29, 2020

Why does combin_first use expressions.where instead of mgr.where? Casting should be handled correctly there

That line hasn't been touched since #17744, will look

@jbrockmendel
Copy link
Member

Now that i have an actual keyboard: this would likely also require fixing Block.where #38073

@arw2019
Copy link
Member Author

arw2019 commented Nov 29, 2020

Now that i have an actual keyboard: this would likely also require fixing Block.where #38073

Ok - in this case this would sit on top of that

I'm struggling a little with the pattern. Atm I have:

    def combine_first(self, other: DataFrame) -> DataFrame:
        def combiner(x, y):
            mask = extract_array(isna(x))

            x_values = extract_array(x, extract_numpy=True)
            y_values = extract_array(y, extract_numpy=True)

            # If the column y in other DataFrame is not in first DataFrame,
            # just return y_values.
            if y.name not in self.columns:
                return y_values

            _where = self._mgr.where(y_values, mask, align=True, errors="raise", try_cast=True, axis=1)
            # extract array from _where

        return self.combine(other, combiner, overwrite=False)

Is this in the right ballpark?

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 29, 2020
@jreback jreback added this to the 1.2 milestone Nov 29, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that i have an actual keyboard: this would likely also require fixing Block.where #38073

@jbrockmendel is the current patch not good for 1.2? (agreed could use some refactoring).

"scalar1, scalar2",
[
(datetime(2020, 1, 1), datetime(2020, 1, 2)),
(pd.Period("2020-01-01", "D"), pd.Period("2020-01-02", "D")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add an Interval example here as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jbrockmendel
Copy link
Member

is the current patch not good for 1.2? (agreed could use some refactoring).

This looks good for now

@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

@arw2019 ok small test comment. ping on green.

@arw2019 arw2019 closed this Nov 30, 2020
@arw2019 arw2019 reopened this Nov 30, 2020
@jreback jreback merged commit aad85ad into pandas-dev:master Nov 30, 2020
@jreback
Copy link
Contributor

jreback commented Nov 30, 2020

thanks @arw2019

@jbrockmendel
Copy link
Member

@arw2019 do you have time to try to make this go through self._mgr.where? (no hurry)

@arw2019
Copy link
Member Author

arw2019 commented Dec 1, 2020

@arw2019 do you have time to try to make this go through self._mgr.where? (no hurry)

yes, in the works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

combine_first returns unexpected results for timestamp dataframes
5 participants