Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a validity buffer for Struct columns #3462

Closed
cbilot opened this issue May 21, 2022 · 7 comments
Closed

Introduce a validity buffer for Struct columns #3462

cbilot opened this issue May 21, 2022 · 7 comments
Labels
A-dtype-struct Area: struct data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium python Related to Python Polars

Comments

@cbilot
Copy link

cbilot commented May 21, 2022

polars 0.13.37+ (compiled through edc4dbd)
python 3.10.4
Linux Mint 20.3

Describe your bug.

I would argue that the following two lists are not the same:

1. [{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]
2. [{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

I would also argue that a Series based on these two lists should not be the same. However, Polars currently conflates these two ideas. Either list produces the same Series.

None as an element in a Series

Perhaps the easiest way to see this is to perform a round trip: from List of dictionaries --> Series of struct --> List of dictionaries.

import polars as pl
_values = [
    {"a": 1, "b": 10},
    None,
    {"a": 3, "b": 30},
]
>>> _values
[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

>>> pl.Series(values=_values)
shape: (3,)
Series: '' [struct[2]]
[
        {1,10}
        {null,null}
        {3,30}
]

>>> pl.Series(values=_values).to_list()
[{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

I would argue that the round-trip should have produced:

>>> pl.Series(values=_values)
shape: (3,)
Series: '' [struct[2]]
[
        {1,10}
        null
        {3,30}
]

>>> pl.Series(values=_values).to_list()
[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

Structs with all fields set to None

By contrast, a round-trip for a dictionary with all values set to None preserves the values.

_values = [
    {"a": None, "b": None},
]
>>> _values
[{'a': None, 'b': None}]

>>> pl.Series(values=_values)
shape: (1,)
Series: '' [struct[2]]
[
        {null,null}
]

>>> pl.Series(values=_values).to_list()
[{'a': None, 'b': None}]

is_null returns true for a struct element with all fields set to None

Polars appears to treat a struct with all fields set to None as a None element in a Series.

_values = [
    {"a": None, "b": None},
]
pl.Series(values=_values).is_null()
shape: (1,)
Series: 'a' [bool]
[
        true
]

I would argue that the result should have been False: a struct with all fields set to None is not the same as None as an element of a Series.

Setting a Series element to None.

Polars currently does not allow setting a Series element to None for a Series[struct].

s = pl.Series(values=[{"a": 1, "b": 10}])
>>> s
shape: (1,)
Series: '' [struct[2]]
[
        {1,10}
]

>>> s.set_at_idx(0, None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/projects/polars/py-polars/polars/internals/series.py", line 2287, in set_at_idx
    raise ValueError(
ValueError: could not find the FFI function needed to set at idx for series <builtins.PySeries object at 0x7f27fa753780>

Other Notes

The above may also be applicable to Lists[struct] (as opposed to Series[struct]).

@cbilot cbilot added the bug Something isn't working label May 21, 2022
@cbilot
Copy link
Author

cbilot commented May 21, 2022

I should probably add ... I myself am not convinced that these two Series elements are (necessarily) different in all problem domains:

  1. None
  2. {'survival': None, 'recovery_percent': None}

But might they be in some problem domains or data models? For example, the first may represent that an experiment was not performed (or not applicable); the second may represent that an experiment was performed, but that the information is not yet available.

Regardless, if the goal is to support highly recursive structures from various sources and various problem domains, Polars may need the ability to maintain the distinction between the two concepts: a Series element of None (as is done with Series of other data types) versus a struct with all fields set to None.

@ritchie46
Copy link
Member

I need to think about this one a bit. This will needs some refactors on the way we have designed the Struct logical type. When I designed it I decided that we simply use the fields null values and don't set a validity on the Struct itself.

I think we must enforce that idea. And this

[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

should fail as it should be constructed as follows:

[{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

Then we could later decide if we need validity on structs as well.

@0x26res
Copy link

0x26res commented Mar 9, 2023

Hey, I have similar issues.

I would like to point out that the behaviour of polars diverges from arrow on this point, and @cbilot suggestion would make the two converge.

So in arrow:

  • StructArray holds a mask array (of bool) that specify if the struct is null. As such you can have a non null struct with all its fields set to null
  • When a struct entry is null, it's nested fields don't have to be null them self. You can set them to any arbitrary values. They would be ignored for the most part, and won't be taken into consideration when comparing 2 struct array.
left = pa.StructArray.from_arrays(
    [pa.array(['Valid', 'Ignored'])],
    names=['field1'],
    mask=pa.array([False, True])
)

right = pa.StructArray.from_arrays(
    [pa.array(['Valid', None])],
    names=['field1'],
    mask=pa.array([False, True])
)
assert left == right
  • This may feel weird but it is useful when you take nullability in consideration. You may have a nullable top level struct whose child are not nullable. In which case you'd want to set the non-nullable child fields of null top level struct to default values, and the nullability constraint would still be valid.
with_nullable = pa.StructArray.from_arrays(
    [pa.array(['Valid', ''])],
    fields=[pa.field('field1', pa.string(), nullable=False)],
    mask=pa.array([False, True])
)

Vs this, which is not valid because the field1 is not nullable:

with_nullable = pa.StructArray.from_arrays(
    [pa.array(['Valid', None])],
    fields=[pa.field('field1', pa.string(), nullable=False)],
    mask=pa.array([False, True])
)

So when going from arrow to polar and back some information is lost because of these semantic differences
Here's an example illustrating it:

import pyarrow as pa
import pyarrow.compute as pc
import polars as pl

field1_values = pa.array(['1', ''])
field2_values = pa.array([1, 0])


struct_array = pa.StructArray.from_arrays(
    [field1_values, field2_values],
    names=['field1', 'field2'],
    mask=pa.array([False, True])
)

table = pa.table({"col1": struct_array})

assert table['col1'].combine_chunks().field(0).to_pylist() == ['1', ""], "raw nested value is Preserved, ignoring mask"
assert table['col1'].combine_chunks().field(1).to_pylist() == [1, 0],  "raw nested value is Preserved, ignoring mask"
assert pc.struct_field(table['col1'], 0).to_pylist() == ['1', None], "struct_field overlay the mask on the raw value"
assert pc.struct_field(table['col1'], 1).to_pylist() == [1, None], "struct_field overlay the mask on the raw value"

df = pl.from_arrow(table)

assert df['col1'].is_null().to_list() == [False, True] == table['col1'].is_null().to_pylist(), "As expected"

table_from_polars = df.to_arrow()

assert table_from_polars['col1'].combine_chunks().field(0).to_pylist() == ['1', None], "Should be [1, '']"
assert table_from_polars['col1'].combine_chunks().field(1).to_pylist() == [1, None], "Should be [1, 0]"
assert table_from_polars['col1'].is_null().to_pylist() == [False, False], "Should be [True, False]"

@stinodego stinodego added needs triage Awaiting prioritization by a maintainer A-dtype-struct Area: struct data type P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Jan 13, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 18, 2024
@stinodego stinodego added this to the 1.0.0 milestone Jan 18, 2024
@stinodego stinodego added the python Related to Python Polars label Jan 18, 2024
@stinodego stinodego added P-medium Priority: medium and removed P-high Priority: high labels Mar 26, 2024
@stinodego stinodego removed this from the 1.0.0 milestone Mar 26, 2024
@stinodego
Copy link
Member

stinodego commented Mar 26, 2024

We could solve this by adding a validity buffer to Struct columns to fix this issue. However, this is not a simple change at all. We're not sure if that is the way to go yet.

It currently does not have priority for us to implement this.

@stinodego stinodego added P-low Priority: low and removed P-medium Priority: medium labels Mar 26, 2024
@stinodego stinodego changed the title Series[struct]: null Series element versus a struct element with all fields == null Introduce a validity buffer for Struct columns Apr 8, 2024
@stinodego stinodego added accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium and removed bug Something isn't working P-low Priority: low labels May 25, 2024
@NickCrews
Copy link

I just ran into this when implementing a PR in Ibis

Not just arrow, but other backends such as duckdb, flink, and bigquery all have nullable structs. Polars is definitely the odd one out here. This change would make polars consistent with the rest of the ecosystem. I probably can't do the brunt of the work here, but happy to give a review! Thank you!

@coastalwhite
Copy link
Collaborator

Just as a note for later. If we end up implementing this, we should test the parquet writer/reader to also make sure that this roundtrips.

@coastalwhite
Copy link
Collaborator

This is now implemented. Closing.

@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-struct Area: struct data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

6 participants