Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join cardinality validation shall not raise on multiple null values (that never produce matches) #19624

Closed
sibbiii opened this issue Nov 4, 2024 · 1 comment · Fixed by #19698
Assignees
Labels
A-ops Area: operations accepted Ready for implementation bug Something isn't working P-medium Priority: medium

Comments

@sibbiii
Copy link

sibbiii commented Nov 4, 2024

Description

According to the documentation, the validate parameter of polars.DataFrame.join checks whether
join keys are unique in the left/right/both datasets.

Consequently,

left = pl.DataFrame({'a': [1, 2, None, None]})  # unique key and some missing data
right = pl.DataFrame({'a': [1, 1, 2, 2]})  # non-unique key

_ =left.join(right, how="left", on="a", join_nulls=False, validate="1:m")  

fails with polars.exceptions.ComputeError: join keys did not fulfill 1:m validation.
but this is clearly a one to many join.

Note: nulls are not joined (join_nulls = False),
so the left dataframe is unique (except for the nulls).

Use case

Real world datasets often have null values for missing data (the example above is just greatly simplified). On such datasets, no cardinality validation can be performed on a left join.

I googled a bit and tied to find similar issues but could not find one. I am also not sure if my question here makes sense (the current behaviour is as documented) but in my option the example above is clearly a one-to-many join with some missing data.

Expected behaviour

Exclude null values from the uniqueness check if join_nulls=False (null values will not produce matches)
The example above shall therefore not raise a ComputeError.

(Note: I am not sure if the current behaviour is only because of Pandas.
original issue that added join cardinality validation: #9263)

@sibbiii sibbiii added the enhancement New feature or an improvement of an existing feature label Nov 4, 2024
@orlp orlp added bug Something isn't working accepted Ready for implementation P-medium Priority: medium and removed enhancement New feature or an improvement of an existing feature labels Nov 4, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Nov 4, 2024
@orlp orlp added the A-ops Area: operations label Nov 4, 2024
@barak1412
Copy link
Contributor

@orlp May you assign me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ops Area: operations accepted Ready for implementation bug Something isn't working P-medium Priority: medium
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants