Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join fails because of an uninstructed cast from int to array[int, x] on 1.14.0 #19763

Closed
2 tasks done
TNieuwdorp opened this issue Nov 13, 2024 · 9 comments · Fixed by #19776 or #19860
Closed
2 tasks done

join fails because of an uninstructed cast from int to array[int, x] on 1.14.0 #19763

TNieuwdorp opened this issue Nov 13, 2024 · 9 comments · Fixed by #19776 or #19860
Assignees
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars regression Issue introduced by a new release

Comments

@TNieuwdorp
Copy link
Contributor

TNieuwdorp commented Nov 13, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import numpy as np

dtype = [("id", "<i4")]
data = np.array([[(1,), (2,)]], dtype=dtype)

df = pl.LazyFrame(data).explode(pl.col("*"))

other_df = pl.LazyFrame({"node": [1]})

result = df.join(other_df, left_on="id", right_on="node").collect()

print(result)

Log output

No response

Issue description

When joining two LazyFrames, one constructed from a numpy structured array, and then exploded, the schema reflects the change, but during a join operation this still goes wrong.

The state of the two LazyFrames before the join:
image

The error of the join:
image

This error also occurs when optimizations are turned off:
image

Explicitly casting the columns to int32 before the operation seems to work:
image

Expected behavior

I expect the data type to not be an array, and the join to succeed.

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:              1.14.0
Index type:          UInt32
Platform:            macOS-15.0.1-x86_64-i386-64bit
Python:              3.12.7 (main, Oct  1 2024, 02:05:46) [Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2023.12.2
gevent               <not installed>
great_tables         0.10.0
matplotlib           3.8.4
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.2
pandas               2.2.2
pyarrow              14.0.2
pydantic             1.10.15
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             0.8.2
xlsxwriter           <not installed>
</details>
@TNieuwdorp TNieuwdorp added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 13, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 14, 2024

This should have been fixed by #19753, and we have made a new patch release available; take a look at 1.13.1? 👍

@TNieuwdorp TNieuwdorp changed the title join fails because of an uninstructed cast from int to array[int, x] on 1.13.0 join fails because of an uninstructed cast from int to array[int, x] on 1.13.1 Nov 14, 2024
@TNieuwdorp
Copy link
Contributor Author

@alexander-beedie Unfortunately that patch doesn't fix the problem, and it still occurs exactly as described.

@alexander-beedie
Copy link
Collaborator

Hmm, surprising! Are you able to create a small reproducible test-case that you can paste in to the Issue report?

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Nov 14, 2024

Did you by any chance happen to have exploded the pgm_id column at some point?

I think I can produce the same error message with this -

q = pl.LazyFrame().select(
    pl.lit(pl.Series([[1, 1], [2, 2]], dtype=pl.Array(pl.Int64, 2)))
    .explode()
    .alias("k")
)

q = q.join(pl.LazyFrame({"k": [1, 2]}), on="k")

print(q.collect())

@nameexhaustion nameexhaustion self-assigned this Nov 14, 2024
@nameexhaustion nameexhaustion added regression Issue introduced by a new release and removed needs triage Awaiting prioritization by a maintainer labels Nov 14, 2024
@TNieuwdorp
Copy link
Contributor Author

@nameexhaustion Let me know if you get stuck on this, I might be able to dig a bit deeper in our code to try and figure out the source of the data and order of operations that are applied that leads up to this.

@TNieuwdorp
Copy link
Contributor Author

Did you by any chance happen to have exploded the pgm_id column at some point?

I think I can produce the same error message with this -

q = pl.LazyFrame().select(
    pl.lit(pl.Series([[1, 1], [2, 2]], dtype=pl.Array(pl.Int64, 2)))
    .explode()
    .alias("k")
)

q = q.join(pl.LazyFrame({"k": [1, 2]}), on="k")

print(q.collect())

Checking...

@TNieuwdorp
Copy link
Contributor Author

Yes, explode() is applied to the data!
image

The original data comes from a structured numpy array (although since you reproduced it without that, it might not be relevant)
image

@TNieuwdorp TNieuwdorp changed the title join fails because of an uninstructed cast from int to array[int, x] on 1.13.1 join fails because of an uninstructed cast from int to array[int, x] on 1.14.0 Nov 18, 2024
@TNieuwdorp
Copy link
Contributor Author

TNieuwdorp commented Nov 18, 2024

@nameexhaustion Unfortunately the error persists on 1.14.0 and it seems to be caused by something else...
I'm working on getting you guys a MRE

@c-peters c-peters added the accepted Ready for implementation label Nov 18, 2024
@c-peters c-peters added this to Backlog Nov 18, 2024
@c-peters c-peters moved this to Done in Backlog Nov 18, 2024
@TNieuwdorp
Copy link
Contributor Author

@ritchie46 @nameexhaustion I've updated the original issue with an MRE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
4 participants