Merge nonstring columns #46879

ericman93 · 2022-04-26T20:05:57Z

closes BUG: column labels converted to string in merge #46885
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

pep8speaks · 2022-04-26T20:06:00Z

Hello @ericman93! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-13 09:36:13 UTC

WillAyd · 2022-04-26T23:16:06Z

Thanks for the PR. Is there an open issue for this?

ericman93 · 2022-04-27T08:21:07Z

@WillAyd not that I could find

mroeschke · 2022-04-27T19:03:25Z

Thanks for the PR. As mentioned, for changes like this, it's best that this PR as associated with an existing issue since it's unclear the motivation and context for this change.

ericman93 · 2022-04-27T20:46:35Z

@mroeschke here is the issue #46885

jreback

i kind of -0.5 on this. i understand the motivation but this is non-standard case and i don't see a good reason to support it.

WillAyd · 2022-04-28T02:01:04Z

@ericman93 reading through your original issue does passing suffixes=None not do what you want?

ericman93 · 2022-04-28T05:33:45Z

@WillAyd it is, but passing suffixes=None with overlap columns will raise an exception

    if not lsuffix and not rsuffix:
        raise ValueError(f"columns overlap but no suffix specified: {to_rename}")

attack68 · 2022-04-28T16:44:45Z

I don't think this is best way of doing this, since it might mess with a lot of types e.g. complex. However, one generic change might be to implement:

if x in to_rename and suffix is not None:
    try:
        return x + suffix
    except TypeError:
        return f"{x}{suffix}"

This would then preserve types such as string or int, and if you wanted you could overload your own class so that the combination of the suffix through addition was defined:

class Column:
    def __init__(self, name):
        self.name = name
    def __add__(self, arg):
        return Column(name=self.name+arg.name)

For example if you index was tuples say (ix1, ix2) you could define a suffix (1,) and the addition would result in a tuple, (ix1, ix2, 1).

doc/source/whatsnew/v1.5.0.rst

ericman93 · 2022-04-28T17:40:16Z

@attack68 I think that trying x + suffix will have weird and unwanted behavior with interger columns

attack68 · 2022-04-28T17:42:39Z

why?

ericman93 · 2022-04-28T17:47:13Z

ha never mind, you are right
I thought about a scenario where the suffix might be int, so + will return the math value of the addition
but suffix is string only
I'll change the code to your suggestion

attack68 · 2022-04-28T17:49:43Z

even when non string suffixes are given and it might add ints that are weird, one could always input a string suffix and this reverts to past behaviour anyway. there might be cases where intger addition could be useful too

ericman93 · 2022-04-28T17:58:04Z

don't you think having try-catch will make things slower?
all numeric columns for example will raise an exception and will have worse performance
maybe passing callable as a suffix instead of renamer is a better approach?

attack68 · 2022-04-29T16:19:39Z

you can measure this yourself, but it suffices to say the try/except on an int and string catching a type error is 3x slower, but the operation only costs 500ns more. Even if you had 1 million columns this would only be worth 0.5seconds performance degradation. And if you had 1million columns you could supply suffixes with an appropriate format to not raise a TypeError (and so avoid any degradation at all) - or do this in a much more efficient way altogether.

ericman93 · 2022-05-02T19:43:58Z

@attack68 fixed according to your suggestion

doc/source/whatsnew/v1.5.0.rst

simonjayhawkins · 2022-05-04T11:14:32Z

@WillAyd it is, but passing suffixes=None with overlap columns will raise an exception
    if not lsuffix and not rsuffix:
        raise ValueError(f"columns overlap but no suffix specified: {to_rename}")

IMO allowing None to allow duplicate columns gives more user control than the changes here as stated in the issue #46885 (comment) ...

Regards the duplication, IMO its ok to have 2 identical columns and let the user decide how to handle it by his own

However, for even more user control, it may be worth considering allowing a renamer function or dictionary to be passed to suffixes instead.

otherwise agree with #46879 (review)

ericman93 · 2022-05-04T11:51:33Z

@simonjayhawkins removing the validation for none suffix is the easiest way to go, but I think I'll change Suffixes to have a tuple of Optional[Union[Callable, str]]
what do you think?

jreback

looks good. pls move the note and ping on green.

jreback · 2022-05-20T12:48:13Z

doc/source/whatsnew/v1.5.0.rst

@@ -796,6 +796,8 @@ Reshaping
 - Bug in :func:`concat` with identical key leads to error when indexing :class:`MultiIndex` (:issue:`46519`)
 - Bug in :meth:`DataFrame.join` with a list when using suffixes to join DataFrames with duplicate column names (:issue:`46396`)
 - Bug in :meth:`DataFrame.pivot_table` with ``sort=False`` results in sorted index (:issue:`17041`)
+- Bug in :func:`merge` and :meth:`DataFrame.merge` now allows passing ``None`` or ``(None, None)`` for ``suffixes`` argument, keeping column labels unchanged in the resulting :class:`DataFrame` potentially with duplicate column labels (:issue:`46885`)


let's move these to other enhancements section (change the text as these are not bugs per se, rather a change in api)

you can reference the original issue as a bug report if you want (just make it a single note). but i also want a note in enhancements.

moved to enhancements

jreback · 2022-05-20T12:50:09Z

doc/source/whatsnew/v1.5.0.rst

@@ -796,6 +796,8 @@ Reshaping
 - Bug in :func:`concat` with identical key leads to error when indexing :class:`MultiIndex` (:issue:`46519`)
 - Bug in :meth:`DataFrame.join` with a list when using suffixes to join DataFrames with duplicate column names (:issue:`46396`)
 - Bug in :meth:`DataFrame.pivot_table` with ``sort=False`` results in sorted index (:issue:`17041`)
+- Bug in :func:`merge` and :meth:`DataFrame.merge` now allows passing ``None`` or ``(None, None)`` for ``suffixes`` argument, keeping column labels unchanged in the resulting :class:`DataFrame` potentially with duplicate column labels (:issue:`46885`)


you can reference the original issue as a bug report if you want (just make it a single note). but i also want a note in enhancements.

ericman93 · 2022-05-20T17:47:05Z

@jreback its all green
but I want to see what @simonjayhawkins has to say about the default value of the suffixes in join

ericman93 · 2022-06-06T21:10:09Z

@jreback do we still want to wait for @simonjayhawkins response or should we push it?

jreback · 2022-06-06T21:11:07Z

if u rebase we can look again

ericman93 · 2022-06-07T06:46:02Z

@jreback done

github-actions · 2022-07-14T00:06:33Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

datapythonista

I'm personally -1 on this. It adds even more complexity and more things that can fail to a part of pandas that is already tricky, that it's supporting non-string columns, and allowing duplicate columns.

I agree that automatically casting columns to strings to add suffixes is not a good practice. But instead of adding this extra complexity, I'd prefer to simply raise an exception if non-string duplicate column names are going to be generated as a result of an operation. So the user can decide what's best for their case, and write their code accordingly.

simonjayhawkins · 2022-07-22T12:05:04Z

I'm personally -1 on this. It adds even more complexity and more things that can fail to a part of pandas that is already tricky, that it's supporting non-string columns, and allowing duplicate columns.

I think that was generally the reason for not doing this before.

non-string columns and duplicate columns are legitimate supported features of pandas and so I think they should be supported and work globally with all pandas methods.

But to achieve this, it is probably not easily achieved (with discussion) in a single PR (with suggested follow ups) and without considering the behavior of other pandas methods.

So achieve this goal, a PDEP or two maybe required, say "Supporting non-string column labels globally and consistently" and "Supporting duplicate column labels globally and consistently" where extensive background material is collated to describe the current behavior (good an bad) and to ensure that a universally consistent approach is adopted to, say, the api, index sorting, label mangling, exception raising and more.

I agree that automatically casting columns to strings to add suffixes is not a good practice. But instead of adding this extra complexity, I'd prefer to simply raise an exception if non-string duplicate column names are going to be generated as a result of an operation. So the user can decide what's best for their case, and write their code accordingly.

This is basically the status quo, so am happy to +1 this for now (the quoted comment, not the PR, sorry for adding potential confusion).

datapythonista

Fair enough. I think the discussion we should have is whether we want to continue to support duplicated and non-string column names. And this is unrelated to this PR.

I may make a PDEP proposal to discuss it when the PDEP framework is ready. Would be good to know what are the use cases for duplicate and non-string column names. If there are other ways to deal with them, and if it's worth the extra complexity.

Still not a big fan of what's introduced here, but feel free to move forward with this PR.

simonjayhawkins · 2022-07-22T12:59:07Z

Still not a big fan of what's introduced here, but feel free to move forward with this PR.

the +1 was for your comment, not the PR, sorry for adding potential confusion.

bashtage · 2022-07-22T14:12:35Z

I've recently been working pandas-stubs and have discovered that the claimed Hashable | None behavior of columns is very flakey, at least as far as the log indexer goes. I was trying to test a large range of hashable types beyond the usual (str, int, etc), and tried frozenset. It was very difficult to use this object

a = frozenset(["a","a"])
ab = frozenset(["a","a","b"])
import pandas as pd
df = pd.DataFrame({a:[1],ab:[3.0]})
df[a] # OK
df.loc[:, a]

KeyError: "None of [Index(['b', 'a'], dtype='object')] are in the [columns]"

The non-robustness of operations makes me wonder is encouraging more arbitrary types (that are hashable) is a good idea.

bashtage · 2022-07-22T14:17:02Z

I may make a PDEP proposal to discuss it when the PDEP framework is ready. Would be good to know what are the use cases for duplicate and non-string column names.

For duplicates one case is repeated measurements from some labeled entity. For example, suppore you get 10 measurements of type "a" and 7 of type "b" where each has a set of characteristics. If there is not other index information, then the natural set of columns is ["a"]*10 + ["b"]*7. Of course, one could use a MultiIndex of the form ("a", 0), ("a", 1), ..., ("b", 0), but this seems like overkill.

As for non-string column names. surely integer column names are both useful and common. I would also think dates have obvious utility.

bashtage · 2022-07-22T14:18:31Z

This is basically the status quo, so am happy to +1 this for now (the quoted comment, not the PR, sorry for adding potential confusion).

I also agree that raising and kicking it back to the user to address is the cleanest approach.

datapythonista · 2022-07-22T16:46:10Z

Thanks @bashtage, very useful info. I think it may still be worth going deeper into those use cases. What you say makes perfect sense, but I wonder what the whole pipeline could look like. A dataframe with ten A columns and seven B columns can make sense as you say. But what would a user do with it? Use columns one at a time? Then maybe adding suffixes to disambiguate is better. Or would be user group them and compute an average? In that case, maybe supporting an option to stack/unstack when duplicate columns exist in the source would be more useful to the user than having duplicate column names.

We may reach the conclusion that allowing duplicates as we do now ia the best we can do. But since there is an obvious trade off, and a lot of added complexity, it may be useful to do an exercise of trying to better understand specific use cases, and what would be the best in every case.

jbrockmendel · 2022-07-23T17:22:38Z

non-string columns and duplicate columns are legitimate supported features of pandas and so I think they should be supported and work globally with all pandas methods.

Agreed.

bashtage · 2022-07-24T23:45:29Z

Thanks @bashtage, very useful info. I think it may still be worth going deeper into those use cases. What you say makes perfect sense, but I wonder what the whole pipeline could look like. A dataframe with ten A columns and seven B columns can make sense as you say. But what would a user do with it? Use columns one at a time? Then maybe adding suffixes to disambiguate is better. Or would be user group them and compute an average

Grouping them to perform statistical analysis based on the properties of the group. For example, repeated timsers measurements in a series of experiments where the only time that matters is time since start of the experiment run. Different treatments would get different labels, but the time series would be identical otherwise. The natural form for this data would be run label in columns, and relative time in rows (to me, at least).

Another place where duplicated arise is then dropping levels from MultiIndex data. Again, this is sometimes done to group by for statistical analysis within groups.

mroeschke · 2022-08-09T01:29:22Z

Thanks for your effort so far @ericman93, but appears there's not full buy in for this feature yet and looks like it should be discussed further in the linked issue, so closing for now. We can base a future PR off this one if there's more buy in.

ericman93 added 3 commits April 25, 2022 13:37

merge non string columns

4c7722e

fix test

cb22a1e

add new line

7342a72

ericman93 added 3 commits April 26, 2022 23:06

fix lint

ac929ea

fix black

5f3f90e

fix black for real

496419d

ericman93 mentioned this pull request Apr 27, 2022

BUG: column labels converted to string in merge #46885

Open

3 tasks

jreback requested changes Apr 27, 2022

View reviewed changes

attack68 reviewed Apr 28, 2022

View reviewed changes

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved

pr comments

4af2f89

ericman93 added 2 commits May 2, 2022 20:10

__add__ for adding suffix

9fef0f3

fix flake

6aa110e

simonjayhawkins requested changes May 4, 2022

View reviewed changes

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved

ericman93 added 2 commits May 20, 2022 11:53

none isn't aloweed

ce72b94

merge from main

5522ad6

jreback requested changes May 20, 2022

View reviewed changes

jreback added this to the 1.5 milestone May 20, 2022

jreback requested changes May 20, 2022

View reviewed changes

move to enhancements

f62a084

merge from main

4b1df82

merge from main

461d5b6

github-actions bot added the Stale label Jul 14, 2022

datapythonista removed the Stale label Jul 22, 2022

datapythonista requested changes Jul 22, 2022

View reviewed changes

datapythonista reviewed Jul 22, 2022

View reviewed changes

mroeschke removed this from the 1.5 milestone Aug 1, 2022

simonjayhawkins added Needs Discussion Requires discussion from core team before further action Closing Candidate May be closeable, needs more eyeballs labels Aug 2, 2022

mroeschke closed this Aug 9, 2022

Merge nonstring columns #46879

Merge nonstring columns #46879

Conversation

ericman93 commented Apr 26, 2022 • edited Loading

pep8speaks commented Apr 26, 2022 • edited Loading

Comment last updated at 2022-06-13 09:36:13 UTC

WillAyd commented Apr 26, 2022

ericman93 commented Apr 27, 2022

mroeschke commented Apr 27, 2022

ericman93 commented Apr 27, 2022

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Apr 28, 2022

ericman93 commented Apr 28, 2022 • edited Loading

attack68 commented Apr 28, 2022

ericman93 commented Apr 28, 2022

attack68 commented Apr 28, 2022

ericman93 commented Apr 28, 2022

attack68 commented Apr 28, 2022

ericman93 commented Apr 28, 2022

attack68 commented Apr 29, 2022

ericman93 commented May 2, 2022

simonjayhawkins commented May 4, 2022

ericman93 commented May 4, 2022

jreback left a comment

Choose a reason for hiding this comment

jreback May 20, 2022

Choose a reason for hiding this comment

jreback May 20, 2022

Choose a reason for hiding this comment

ericman93 May 20, 2022

Choose a reason for hiding this comment

jreback May 20, 2022

Choose a reason for hiding this comment

ericman93 commented May 20, 2022

ericman93 commented Jun 6, 2022

jreback commented Jun 6, 2022

ericman93 commented Jun 7, 2022

github-actions bot commented Jul 14, 2022

datapythonista left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Jul 22, 2022 • edited Loading

datapythonista left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Jul 22, 2022

bashtage commented Jul 22, 2022

bashtage commented Jul 22, 2022

bashtage commented Jul 22, 2022

datapythonista commented Jul 22, 2022

jbrockmendel commented Jul 23, 2022

bashtage commented Jul 24, 2022

mroeschke commented Aug 9, 2022

ericman93 commented Apr 26, 2022 •

edited

Loading

pep8speaks commented Apr 26, 2022 •

edited

Loading

ericman93 commented Apr 28, 2022 •

edited

Loading

simonjayhawkins commented Jul 22, 2022 •

edited

Loading