Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-sort column order when concat vertical and vertical_relaxed #9891

Closed
stevenlis opened this issue Jul 14, 2023 · 19 comments · Fixed by #11597
Closed

Auto-sort column order when concat vertical and vertical_relaxed #9891

stevenlis opened this issue Jul 14, 2023 · 19 comments · Fixed by #11597
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@stevenlis
Copy link

Problem description

As of 0.18.7, Polars will return an error if columns' order does not align with one another when concat vertically.

import polars as pl

df1 = pl.DataFrame({'a': [1, 2], 'b': [2, 1]})
df2 = pl.DataFrame({'b': [1, 2], 'a': [None, 1]})
pl.concat([df1, df2], how='vertical')
# ShapeError: unable to vstack, column names don't match: "a" and "b"
pl.concat([df1, df2], how='vertical_relaxed')
# ComputeError: schema names differ: got a, expected b

It would be great if polars could sort the columns itself.

@stevenlis stevenlis added the enhancement New feature or an improvement of an existing feature label Jul 14, 2023
@stevenlis
Copy link
Author

It might be worth mentioning this requirement in the documentation in the meantime.

@avimallu
Copy link
Contributor

This might help you from previous discussions: #7866 (comment)

@ritchie46
Copy link
Member

It would be great if polars could sort the columns itself.

And a great source of silent bugs.

If you want them sorted, why not do it yourself?

columns = df1.columns
pl.concat([df.select(columns) for df in [df1, df2]])

This has been asked before and this remains my answer, as this is super solvable as the snippet above shows.

@stinodego
Copy link
Member

I will clarify this in the docs and then close these issues as not planned.

@stinodego stinodego self-assigned this Jul 15, 2023
@magarick
Copy link
Contributor

I'm happy to do this unless you have a reason not to want it. It's much more conducive to how a lot of people work to have this as an option. I like the way data.table does it where everything is opt-in and the default is very strict. Having to do 3 operations (get the columns, sort the columns for each one, concat) is just a slight bit of cognitive burden that we don't need.

@ritchie46
Copy link
Member

Opt-in sorting seems ok. What I am wary of, is cases where schema width differs.

@magarick if you want to add opt-in sorting (or using the first df as guiding schema), can you add that logic on the rust side?

@magarick
Copy link
Contributor

Yeah, happy to do it.
The way that data.table handles it, which I like, is that by default it matches on the names and it will yell at you if they aren't all equal. You can opt in to filling columns not present in all tables will nulls if you want, but that's off by default, as it should be. Here, I assume you're also being strict on column types, which maybe should also have an option to cast up.

But to be very clear, the default is that all names and types have to match. In that case I think sorting is fine by default, but I'm ok being extra strict and requiring a perfect match with no flags set.

@ritchie46
Copy link
Member

I assume you're also being strict on column types, which maybe should also have an option to cast up.

Yes, that's what strategy vertical_relaxed

@magarick hold your horses!

I realize we already have this. 🙈

Strategy diagonal will solve this.

We could have a diagonal_relaxed as well. That one would then also upcast columns to their supertype.

@stinodego stinodego removed their assignment Jul 15, 2023
@stinodego
Copy link
Member

So we won't change the vertical and vertical_relaxed options, but an implementation for diagonal_relaxed is welcome.

@stinodego stinodego added the accepted Ready for implementation label Jul 15, 2023
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jul 15, 2023
@stevenlis
Copy link
Author

It's getting a bit messy. If the doc allows, I hope we could put a chart like the following to help users to find their desired how

Data Type Match Column Order Match
vertical Y Y
vertical_relaxed N Y
diagonal Y N
diagonal_relaxed N N

@stinodego
Copy link
Member

Maybe we should have a separate boolean relaxed parameter?

@ritchie46
Copy link
Member

ritchie46 commented Jul 16, 2023

Maybe we should have a separate boolean relaxed parameter?

I don't like arguments influencing each other. Especially if you have cases where the relaxed isn't supported.

Currently that is the case with diagonal. What id we get a new strategy for which relaxed isn't supported?

Having arguments as sum types (an enum) of strategies keeps the arguments orthogonal.

@ritchie46
Copy link
Member

It's getting a bit messy. If the doc allows, I hope we could put a chart like the following to help users to find their desired how
Data Type Match Column Order Match
vertical Y Y
vertical_relaxed N Y
diagonal Y N
diagonal_relaxed N N

Not that diagonal allows for missing columns as well.

@stevenlis
Copy link
Author

stevenlis commented Jul 16, 2023

If I understand you correctly, maybe something like this?

                           Required Match
Data Type Column Order DataFrame Width
vertical Y Y Y
vertical_relaxed N Y Y
diagonal N Y N (outer join)
diagonal_relaxed N N N (outer join)

@magarick
Copy link
Contributor

Looking at this more, I think a larger rework could be better.

  1. Combining by row (vertical) and by column (horizontal) in one function seems weird to me. They're different operations and have potentially different cost implications (if I'm not mistaken you can tack on equal length series for free).
  2. Strategy names: This is a great example where you have full names and no abbreviation but I have no idea what things mean. It's possible "relaxed" and "diagonal" are terms of art somewhere I'm not familiar with (database land). Regardless, I think it's clearer to have them as two parameters, one for whether you want to fill missing columns, and one for upcasting.
  3. Horizontal/columnwise behavior could use some fleshing out, most obviously by giving more options for handling series of unequal length.

I'm happy to work on this if y'all are alright with something more in depth.

@ritchie46
Copy link
Member

"diagonal" does more than the schema order. It grows in a diagonal matter. New columns are added if they don't exist. So you grow horizontal and vertically -> diagonal.

I think that makes sense. :)

"relaxed" is mean to be the opposite of "strict". Open to a better name of that opposite.

@stevenlis
Copy link
Author

@ritchie46 This diagonal_relaxed appears to be the only missing feature at the moment and occasionally confuses me. 😅

@alexander-beedie
Copy link
Collaborator

This diagonal_relaxed appears to be the only missing feature at the moment and occasionally confuses me. 😅

@stevenlis: "diagonal_relaxed" will be available as a first-class strategy option in the upcoming 0.19.8 ;)

@stevenlis
Copy link
Author

@alexander-beedie Thanks! I hope you could also consider adding the table above to the document because it's becoming a bit complicated, or perhaps finding another way to differentiate them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
6 participants