Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first() returns wrong result on empty Dataframe with struct column #15727

Closed
2 tasks done
barak1412 opened this issue Apr 17, 2024 · 7 comments
Closed
2 tasks done

first() returns wrong result on empty Dataframe with struct column #15727

barak1412 opened this issue Apr 17, 2024 · 7 comments
Assignees
Labels
A-dtype-struct Area: struct data type bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@barak1412
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl


df = pl.DataFrame({
    'col1': []
}, schema={'col1': pl.Struct([
    pl.Field("n", dtype=pl.Int64),
    pl.Field("o", dtype=pl.Float64),
    pl.Field("l", dtype=pl.Utf8)
])})

print(df.select(pl.first('col1')))

Log output

No response

Issue description

On above example we get:

shape: (1, 1)
┌──────────────────┐
│ col1             │
│ ---              │
│ struct[3]        │
╞══════════════════╡
│ {null,null,null} │
└──────────────────┘

Instead of getting empy dataframe.

Expected behavior

Expected output:

shape: (0, 1)
┌──────────────────┐
│ col1             │
│ ---              │
│ struct[3]        │
╞══════════════════╡
└──────────────────┘

Installed versions

--------Version info---------
Polars:               0.20.21
Index type:           UInt32
Platform:             Windows-10-10.0.19041-SP0
Python:               3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.0
connectorx:           0.3.3a1
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.5.1
nest_asyncio:         1.5.1
numpy:                1.24.4
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              13.0.0
pydantic:             1.8.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.32
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@barak1412 barak1412 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 17, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce.

Also happens with .last()

>>> df.select(pl.col('col1').last())
shape: (1, 1)
┌──────────────────┐
│ col1             │
│ ---              │
│ struct[3]        │
╞══════════════════╡
│ {null,null,null} │
└──────────────────┘

Doesn't happen with .slice()

>>> df.select(pl.col('col1').slice(0, 1))
shape: (0, 1)
┌───────────┐
│ col1      │
│ ---       │
│ struct[3] │
╞═══════════╡
└───────────┘

@orlp
Copy link
Collaborator

orlp commented Apr 17, 2024

I'll fix this tomorrow.

@orlp orlp self-assigned this Apr 17, 2024
@orlp orlp added P-medium Priority: medium A-dtype-struct Area: struct data type and removed needs triage Awaiting prioritization by a maintainer labels Apr 17, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Apr 17, 2024
@orlp
Copy link
Collaborator

orlp commented Apr 17, 2024

Wait a second, this is actually currently 'correct', until we add outer nullability for structs. .first() should return null if there's nothing to select:

>>> pl.DataFrame({"x": []}, schema={"x": pl.Int64}).select(pl.col.x.last())
shape: (1, 1)
┌──────┐
│ x    │
│ ---  │
│ i64  │
╞══════╡
│ null │
└──────┘

For structs this currently means returning an all-null row, since structs currently lack outer nullability.

I will close this issue as a duplicate of #3462.

@orlp orlp closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Apr 17, 2024
@cmdlineluser
Copy link
Contributor

Ah - I thought the issue was that it returned 1 row for a frame of height 0.

But that happens regardless of the dtype.

df = pl.DataFrame({"foo": []}).cast(pl.String)
# shape: (0, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ str │
# ╞═════╡
# └─────┘

df.select(pl.first("foo"))
# shape: (1, 1)
# ┌──────┐
# │ foo  │
# │ ---  │
# │ str  │
# ╞══════╡
# │ null │
# └──────┘

@barak1412
Copy link
Contributor Author

@cmdlineluser

Yes, Iunderstood from @orlp answers.
However, it still seems to me pretty odd to get one row with null value.

@cmdlineluser
Copy link
Contributor

Yes, the term that seems to be used for this is: introduces a "phantom row".

I'm not sure if this is expected or not - or if it has been previously discussed.

As the title/example was about structs - it seemed like you may have been talking about the other problem. (struct validity)

Perhaps you could create an issue without mentioning structs to discuss the first() behaviour.

@barak1412
Copy link
Contributor Author

@cmdlineluser you are right.
The overall behaviour of first should be disscussed, unrelated to Struct type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-struct Area: struct data type bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

3 participants