Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge nonstring columns #46879

Closed
wants to merge 26 commits into from

Conversation

ericman93
Copy link

@ericman93 ericman93 commented Apr 26, 2022

@pep8speaks
Copy link

pep8speaks commented Apr 26, 2022

Hello @ericman93! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-13 09:36:13 UTC

@WillAyd
Copy link
Member

WillAyd commented Apr 26, 2022

Thanks for the PR. Is there an open issue for this?

@ericman93
Copy link
Author

@WillAyd not that I could find

@mroeschke
Copy link
Member

Thanks for the PR. As mentioned, for changes like this, it's best that this PR as associated with an existing issue since it's unclear the motivation and context for this change.

@ericman93
Copy link
Author

@mroeschke here is the issue #46885

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kind of -0.5 on this. i understand the motivation but this is non-standard case and i don't see a good reason to support it.

@WillAyd
Copy link
Member

WillAyd commented Apr 28, 2022

@ericman93 reading through your original issue does passing suffixes=None not do what you want?

@ericman93
Copy link
Author

ericman93 commented Apr 28, 2022

@WillAyd it is, but passing suffixes=None with overlap columns will raise an exception

    if not lsuffix and not rsuffix:
        raise ValueError(f"columns overlap but no suffix specified: {to_rename}")

@attack68
Copy link
Contributor

I don't think this is best way of doing this, since it might mess with a lot of types e.g. complex. However, one generic change might be to implement:

if x in to_rename and suffix is not None:
    try:
        return x + suffix
    except TypeError:
        return f"{x}{suffix}"

This would then preserve types such as string or int, and if you wanted you could overload your own class so that the combination of the suffix through addition was defined:

class Column:
    def __init__(self, name):
        self.name = name
    def __add__(self, arg):
        return Column(name=self.name+arg.name)

For example if you index was tuples say (ix1, ix2) you could define a suffix (1,) and the addition would result in a tuple, (ix1, ix2, 1).

@ericman93
Copy link
Author

@attack68 I think that trying x + suffix will have weird and unwanted behavior with interger columns

@attack68
Copy link
Contributor

why?

@ericman93
Copy link
Author

ha never mind, you are right
I thought about a scenario where the suffix might be int, so + will return the math value of the addition
but suffix is string only
I'll change the code to your suggestion

@attack68
Copy link
Contributor

even when non string suffixes are given and it might add ints that are weird, one could always input a string suffix and this reverts to past behaviour anyway. there might be cases where intger addition could be useful too

@ericman93
Copy link
Author

don't you think having try-catch will make things slower?
all numeric columns for example will raise an exception and will have worse performance
maybe passing callable as a suffix instead of renamer is a better approach?

@attack68
Copy link
Contributor

you can measure this yourself, but it suffices to say the try/except on an int and string catching a type error is 3x slower, but the operation only costs 500ns more. Even if you had 1 million columns this would only be worth 0.5seconds performance degradation. And if you had 1million columns you could supply suffixes with an appropriate format to not raise a TypeError (and so avoid any degradation at all) - or do this in a much more efficient way altogether.

@ericman93
Copy link
Author

@attack68 fixed according to your suggestion

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved
@simonjayhawkins
Copy link
Member

@WillAyd it is, but passing suffixes=None with overlap columns will raise an exception

    if not lsuffix and not rsuffix:
        raise ValueError(f"columns overlap but no suffix specified: {to_rename}")

IMO allowing None to allow duplicate columns gives more user control than the changes here as stated in the issue #46885 (comment) ...

Regards the duplication, IMO its ok to have 2 identical columns and let the user decide how to handle it by his own

However, for even more user control, it may be worth considering allowing a renamer function or dictionary to be passed to suffixes instead.

otherwise agree with #46879 (review)

@ericman93
Copy link
Author

@simonjayhawkins removing the validation for none suffix is the easiest way to go, but I think I'll change Suffixes to have a tuple of Optional[Union[Callable, str]]
what do you think?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. pls move the note and ping on green.

@@ -796,6 +796,8 @@ Reshaping
- Bug in :func:`concat` with identical key leads to error when indexing :class:`MultiIndex` (:issue:`46519`)
- Bug in :meth:`DataFrame.join` with a list when using suffixes to join DataFrames with duplicate column names (:issue:`46396`)
- Bug in :meth:`DataFrame.pivot_table` with ``sort=False`` results in sorted index (:issue:`17041`)
- Bug in :func:`merge` and :meth:`DataFrame.merge` now allows passing ``None`` or ``(None, None)`` for ``suffixes`` argument, keeping column labels unchanged in the resulting :class:`DataFrame` potentially with duplicate column labels (:issue:`46885`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move these to other enhancements section (change the text as these are not bugs per se, rather a change in api)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can reference the original issue as a bug report if you want (just make it a single note). but i also want a note in enhancements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to enhancements

@jreback jreback added this to the 1.5 milestone May 20, 2022
@@ -796,6 +796,8 @@ Reshaping
- Bug in :func:`concat` with identical key leads to error when indexing :class:`MultiIndex` (:issue:`46519`)
- Bug in :meth:`DataFrame.join` with a list when using suffixes to join DataFrames with duplicate column names (:issue:`46396`)
- Bug in :meth:`DataFrame.pivot_table` with ``sort=False`` results in sorted index (:issue:`17041`)
- Bug in :func:`merge` and :meth:`DataFrame.merge` now allows passing ``None`` or ``(None, None)`` for ``suffixes`` argument, keeping column labels unchanged in the resulting :class:`DataFrame` potentially with duplicate column labels (:issue:`46885`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can reference the original issue as a bug report if you want (just make it a single note). but i also want a note in enhancements.

@ericman93
Copy link
Author

@jreback its all green
but I want to see what @simonjayhawkins has to say about the default value of the suffixes in join

@ericman93
Copy link
Author

@jreback do we still want to wait for @simonjayhawkins response or should we push it?

@jreback
Copy link
Contributor

jreback commented Jun 6, 2022

if u rebase we can look again

@ericman93
Copy link
Author

@jreback done

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally -1 on this. It adds even more complexity and more things that can fail to a part of pandas that is already tricky, that it's supporting non-string columns, and allowing duplicate columns.

I agree that automatically casting columns to strings to add suffixes is not a good practice. But instead of adding this extra complexity, I'd prefer to simply raise an exception if non-string duplicate column names are going to be generated as a result of an operation. So the user can decide what's best for their case, and write their code accordingly.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Jul 22, 2022

I'm personally -1 on this. It adds even more complexity and more things that can fail to a part of pandas that is already tricky, that it's supporting non-string columns, and allowing duplicate columns.

I think that was generally the reason for not doing this before.

non-string columns and duplicate columns are legitimate supported features of pandas and so I think they should be supported and work globally with all pandas methods.

But to achieve this, it is probably not easily achieved (with discussion) in a single PR (with suggested follow ups) and without considering the behavior of other pandas methods.

So achieve this goal, a PDEP or two maybe required, say "Supporting non-string column labels globally and consistently" and "Supporting duplicate column labels globally and consistently" where extensive background material is collated to describe the current behavior (good an bad) and to ensure that a universally consistent approach is adopted to, say, the api, index sorting, label mangling, exception raising and more.

I agree that automatically casting columns to strings to add suffixes is not a good practice. But instead of adding this extra complexity, I'd prefer to simply raise an exception if non-string duplicate column names are going to be generated as a result of an operation. So the user can decide what's best for their case, and write their code accordingly.

This is basically the status quo, so am happy to +1 this for now (the quoted comment, not the PR, sorry for adding potential confusion).

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I think the discussion we should have is whether we want to continue to support duplicated and non-string column names. And this is unrelated to this PR.

I may make a PDEP proposal to discuss it when the PDEP framework is ready. Would be good to know what are the use cases for duplicate and non-string column names. If there are other ways to deal with them, and if it's worth the extra complexity.

Still not a big fan of what's introduced here, but feel free to move forward with this PR.

@simonjayhawkins
Copy link
Member

Still not a big fan of what's introduced here, but feel free to move forward with this PR.

the +1 was for your comment, not the PR, sorry for adding potential confusion.

@bashtage
Copy link
Contributor

I've recently been working pandas-stubs and have discovered that the claimed Hashable | None behavior of columns is very flakey, at least as far as the log indexer goes. I was trying to test a large range of hashable types beyond the usual (str, int, etc), and tried frozenset. It was very difficult to use this object

a = frozenset(["a","a"])
ab = frozenset(["a","a","b"])
import pandas as pd
df = pd.DataFrame({a:[1],ab:[3.0]})
df[a] # OK
df.loc[:, a]
KeyError: "None of [Index(['b', 'a'], dtype='object')] are in the [columns]"

The non-robustness of operations makes me wonder is encouraging more arbitrary types (that are hashable) is a good idea.

@bashtage
Copy link
Contributor

I may make a PDEP proposal to discuss it when the PDEP framework is ready. Would be good to know what are the use cases for duplicate and non-string column names.

For duplicates one case is repeated measurements from some labeled entity. For example, suppore you get 10 measurements of type "a" and 7 of type "b" where each has a set of characteristics. If there is not other index information, then the natural set of columns is ["a"]*10 + ["b"]*7. Of course, one could use a MultiIndex of the form ("a", 0), ("a", 1), ..., ("b", 0), but this seems like overkill.

As for non-string column names. surely integer column names are both useful and common. I would also think dates have obvious utility.

@bashtage
Copy link
Contributor

This is basically the status quo, so am happy to +1 this for now (the quoted comment, not the PR, sorry for adding potential confusion).

I also agree that raising and kicking it back to the user to address is the cleanest approach.

@datapythonista
Copy link
Member

Thanks @bashtage, very useful info. I think it may still be worth going deeper into those use cases. What you say makes perfect sense, but I wonder what the whole pipeline could look like. A dataframe with ten A columns and seven B columns can make sense as you say. But what would a user do with it? Use columns one at a time? Then maybe adding suffixes to disambiguate is better. Or would be user group them and compute an average? In that case, maybe supporting an option to stack/unstack when duplicate columns exist in the source would be more useful to the user than having duplicate column names.

We may reach the conclusion that allowing duplicates as we do now ia the best we can do. But since there is an obvious trade off, and a lot of added complexity, it may be useful to do an exercise of trying to better understand specific use cases, and what would be the best in every case.

@jbrockmendel
Copy link
Member

non-string columns and duplicate columns are legitimate supported features of pandas and so I think they should be supported and work globally with all pandas methods.

Agreed.

@bashtage
Copy link
Contributor

Thanks @bashtage, very useful info. I think it may still be worth going deeper into those use cases. What you say makes perfect sense, but I wonder what the whole pipeline could look like. A dataframe with ten A columns and seven B columns can make sense as you say. But what would a user do with it? Use columns one at a time? Then maybe adding suffixes to disambiguate is better. Or would be user group them and compute an average

Grouping them to perform statistical analysis based on the properties of the group. For example, repeated timsers measurements in a series of experiments where the only time that matters is time since start of the experiment run. Different treatments would get different labels, but the time series would be identical otherwise. The natural form for this data would be run label in columns, and relative time in rows (to me, at least).

Another place where duplicated arise is then dropping levels from MultiIndex data. Again, this is sometimes done to group by for statistical analysis within groups.

@mroeschke mroeschke removed this from the 1.5 milestone Aug 1, 2022
@simonjayhawkins simonjayhawkins added Needs Discussion Requires discussion from core team before further action Closing Candidate May be closeable, needs more eyeballs labels Aug 2, 2022
@mroeschke
Copy link
Member

Thanks for your effort so far @ericman93, but appears there's not full buy in for this feature yet and looks like it should be discussed further in the linked issue, so closing for now. We can base a future PR off this one if there's more buy in.

@mroeschke mroeschke closed this Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants