Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow join on different types if upcast is safe #15338

Closed
CaselIT opened this issue Mar 27, 2024 · 7 comments · Fixed by #20332
Closed

Allow join on different types if upcast is safe #15338

CaselIT opened this issue Mar 27, 2024 · 7 comments · Fixed by #20332
Assignees
Labels
A-ops Area: operations accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@CaselIT
Copy link
Contributor

CaselIT commented Mar 27, 2024

Description

It would be nice to allow joining between different data types when upcast is safe, for example i{8,16,32}->i64, u{8,16,32}->u64 etc

Example:

dfI32 = pl.DataFrame({'a': [1,2,3], 'b': list('abc')}).cast({'a': pl.Int32})
dfI64 = pl.DataFrame({'a': [1,2,3], 'c': list('def')})
dfI64.join(dfI32, on='a') # a would keep Int64 type
dfI32.join(dfI64, on='a') # a would become Int64 type (or this could be depending on join type)

Currently both of these error with an exception like

ComputeError: datatypes of join keys don't match - `a`: i64 on left does not match `a`: i32 on right
@CaselIT CaselIT added the enhancement New feature or an improvement of an existing feature label Mar 27, 2024
@stinodego stinodego added the A-ops Area: operations label Mar 27, 2024
@CaselIT
Copy link
Contributor Author

CaselIT commented Apr 5, 2024

This could also be a kwargs that defaults to false, so that polars by itself does no type conversion, but users can opt into safe upcasts if the want to

@tfiasco
Copy link

tfiasco commented Dec 11, 2024

Is there any progress on this feature? In our company, many data types are uncertain, which has become an obstacle for me in promoting Polars internally.

In Pandas, this operation succeeds without error:

import pandas as pd

df1 = pd.DataFrame({'a': list(range(5))})
df2 = df1.copy()
df2['a'] = df1['a'].astype("UInt8")
result = pd.merge(df1, df2, on='a')

Pandas handles the type mismatch by implicitly casting the join keys, allowing the merge to succeed.

@ritchie46
Copy link
Member

ritchie46 commented Dec 12, 2024

We will look into this. Indeed between integer <-> integers it is safe (floats not so much).

@nameexhaustion nameexhaustion self-assigned this Dec 13, 2024
@coastalwhite
Copy link
Collaborator

Is there maybe a way to instead have a align_types expression. I think that would be more scalable and explicit than doing this?

@ritchie46
Copy link
Member

Is there maybe a way to instead have a align_types expression.

That would need to be called before, and return a struct. (if it would be done in the separate expressions it would be way too magic for my liking. I think this is fine as we type resolve more operations.

@landisrm
Copy link

Yes, please can we have this! I'm having a tough time mixing pl.read_database calls with pulls using duckdb; one returns my id columns in i32 and the other in i64

@leoliu0
Copy link

leoliu0 commented Dec 14, 2024

This would be a great feature. In my case, I wrote a function to cast all integer columns to i64 as an initial step to circumvent this issue before performing any further operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ops Area: operations accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants