Introduce a validity buffer for `Struct` columns #3462

cbilot · 2022-05-21T21:16:41Z

polars 0.13.37+ (compiled through edc4dbd)
python 3.10.4
Linux Mint 20.3

Describe your bug.

I would argue that the following two lists are not the same:

1. [{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]
2. [{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

I would also argue that a Series based on these two lists should not be the same. However, Polars currently conflates these two ideas. Either list produces the same Series.

None as an element in a Series

Perhaps the easiest way to see this is to perform a round trip: from List of dictionaries --> Series of struct --> List of dictionaries.

import polars as pl
_values = [
    {"a": 1, "b": 10},
    None,
    {"a": 3, "b": 30},
]

>>> _values
[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

>>> pl.Series(values=_values)
shape: (3,)
Series: '' [struct[2]]
[
        {1,10}
        {null,null}
        {3,30}
]

>>> pl.Series(values=_values).to_list()
[{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

I would argue that the round-trip should have produced:

>>> pl.Series(values=_values)
shape: (3,)
Series: '' [struct[2]]
[
        {1,10}
        null
        {3,30}
]

>>> pl.Series(values=_values).to_list()
[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

Structs with all fields set to None

By contrast, a round-trip for a dictionary with all values set to None preserves the values.

_values = [
    {"a": None, "b": None},
]

>>> _values
[{'a': None, 'b': None}]

>>> pl.Series(values=_values)
shape: (1,)
Series: '' [struct[2]]
[
        {null,null}
]

>>> pl.Series(values=_values).to_list()
[{'a': None, 'b': None}]

`is_null` returns true for a struct element with all fields set to None

Polars appears to treat a struct with all fields set to None as a None element in a Series.

_values = [
    {"a": None, "b": None},
]
pl.Series(values=_values).is_null()

shape: (1,)
Series: 'a' [bool]
[
        true
]

I would argue that the result should have been False: a struct with all fields set to None is not the same as None as an element of a Series.

Setting a Series element to None.

Polars currently does not allow setting a Series element to None for a Series[struct].

s = pl.Series(values=[{"a": 1, "b": 10}])

>>> s
shape: (1,)
Series: '' [struct[2]]
[
        {1,10}
]

>>> s.set_at_idx(0, None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/projects/polars/py-polars/polars/internals/series.py", line 2287, in set_at_idx
    raise ValueError(
ValueError: could not find the FFI function needed to set at idx for series <builtins.PySeries object at 0x7f27fa753780>

Other Notes

The above may also be applicable to Lists[struct] (as opposed to Series[struct]).

The text was updated successfully, but these errors were encountered:

cbilot · 2022-05-21T21:38:52Z

I should probably add ... I myself am not convinced that these two Series elements are (necessarily) different in all problem domains:

None
{'survival': None, 'recovery_percent': None}

But might they be in some problem domains or data models? For example, the first may represent that an experiment was not performed (or not applicable); the second may represent that an experiment was performed, but that the information is not yet available.

Regardless, if the goal is to support highly recursive structures from various sources and various problem domains, Polars may need the ability to maintain the distinction between the two concepts: a Series element of None (as is done with Series of other data types) versus a struct with all fields set to None.

ritchie46 · 2022-05-22T06:53:09Z

I need to think about this one a bit. This will needs some refactors on the way we have designed the Struct logical type. When I designed it I decided that we simply use the fields null values and don't set a validity on the Struct itself.

I think we must enforce that idea. And this

[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

should fail as it should be constructed as follows:

[{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

Then we could later decide if we need validity on structs as well.

0x26res · 2023-03-09T11:48:03Z

Hey, I have similar issues.

I would like to point out that the behaviour of polars diverges from arrow on this point, and @cbilot suggestion would make the two converge.

So in arrow:

StructArray holds a mask array (of bool) that specify if the struct is null. As such you can have a non null struct with all its fields set to null
When a struct entry is null, it's nested fields don't have to be null them self. You can set them to any arbitrary values. They would be ignored for the most part, and won't be taken into consideration when comparing 2 struct array.

left = pa.StructArray.from_arrays(
    [pa.array(['Valid', 'Ignored'])],
    names=['field1'],
    mask=pa.array([False, True])
)

right = pa.StructArray.from_arrays(
    [pa.array(['Valid', None])],
    names=['field1'],
    mask=pa.array([False, True])
)
assert left == right

This may feel weird but it is useful when you take nullability in consideration. You may have a nullable top level struct whose child are not nullable. In which case you'd want to set the non-nullable child fields of null top level struct to default values, and the nullability constraint would still be valid.

with_nullable = pa.StructArray.from_arrays(
    [pa.array(['Valid', ''])],
    fields=[pa.field('field1', pa.string(), nullable=False)],
    mask=pa.array([False, True])
)

Vs this, which is not valid because the field1 is not nullable:

with_nullable = pa.StructArray.from_arrays(
    [pa.array(['Valid', None])],
    fields=[pa.field('field1', pa.string(), nullable=False)],
    mask=pa.array([False, True])
)

So when going from arrow to polar and back some information is lost because of these semantic differences
Here's an example illustrating it:

import pyarrow as pa
import pyarrow.compute as pc
import polars as pl

field1_values = pa.array(['1', ''])
field2_values = pa.array([1, 0])


struct_array = pa.StructArray.from_arrays(
    [field1_values, field2_values],
    names=['field1', 'field2'],
    mask=pa.array([False, True])
)

table = pa.table({"col1": struct_array})

assert table['col1'].combine_chunks().field(0).to_pylist() == ['1', ""], "raw nested value is Preserved, ignoring mask"
assert table['col1'].combine_chunks().field(1).to_pylist() == [1, 0],  "raw nested value is Preserved, ignoring mask"
assert pc.struct_field(table['col1'], 0).to_pylist() == ['1', None], "struct_field overlay the mask on the raw value"
assert pc.struct_field(table['col1'], 1).to_pylist() == [1, None], "struct_field overlay the mask on the raw value"

df = pl.from_arrow(table)

assert df['col1'].is_null().to_list() == [False, True] == table['col1'].is_null().to_pylist(), "As expected"

table_from_polars = df.to_arrow()

assert table_from_polars['col1'].combine_chunks().field(0).to_pylist() == ['1', None], "Should be [1, '']"
assert table_from_polars['col1'].combine_chunks().field(1).to_pylist() == [1, None], "Should be [1, 0]"
assert table_from_polars['col1'].is_null().to_pylist() == [False, False], "Should be [True, False]"

stinodego · 2024-03-26T09:36:13Z

We could solve this by adding a validity buffer to Struct columns to fix this issue. However, this is not a simple change at all. We're not sure if that is the way to go yet.

It currently does not have priority for us to implement this.

NickCrews · 2024-06-04T01:12:50Z

I just ran into this when implementing a PR in Ibis

Not just arrow, but other backends such as duckdb, flink, and bigquery all have nullable structs. Polars is definitely the odd one out here. This change would make polars consistent with the rest of the ecosystem. I probably can't do the brunt of the work here, but happy to give a review! Thank you!

coastalwhite · 2024-06-10T10:22:30Z

Just as a note for later. If we end up implementing this, we should test the parquet writer/reader to also make sure that this roundtrips.

coastalwhite · 2024-10-30T14:51:07Z

This is now implemented. Closing.

cbilot added the bug Something isn't working label May 21, 2022

stinodego added needs triage Awaiting prioritization by a maintainer A-dtype-struct Area: struct data type P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Jan 13, 2024

github-project-automation bot added this to Backlog Jan 18, 2024

github-project-automation bot moved this to Ready in Backlog Jan 18, 2024

stinodego added this to the 1.0.0 milestone Jan 18, 2024

stinodego added the python Related to Python Polars label Jan 18, 2024

stinodego added P-medium Priority: medium and removed P-high Priority: high labels Mar 26, 2024

stinodego removed this from the 1.0.0 milestone Mar 26, 2024

stinodego added P-low Priority: low and removed P-medium Priority: medium labels Mar 26, 2024

stinodego changed the title ~~Series[struct]: null Series element versus a struct element with all fields == null~~ Introduce a validity buffer for Struct columns Apr 8, 2024

reswqa mentioned this issue Apr 9, 2024

Regression 0.20.15->0.20.16: ComputeError: conversion from null to struct[100] failed in column 'literal' for 0 out of 1 values #15476

Closed

2 tasks

orlp mentioned this issue Apr 16, 2024

Unexpected interaction between pl.when and pl.struct #15703

Closed

2 tasks

stinodego mentioned this issue Apr 17, 2024

Support empty structs #9216

Closed

orlp mentioned this issue Apr 17, 2024

first() returns wrong result on empty Dataframe with struct column #15727

Closed

2 tasks

stinodego added accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium and removed bug Something isn't working P-low Priority: low labels May 25, 2024

stinodego mentioned this issue Jun 3, 2024

pl.lit(None, dtype=pl.Struct({"a": pl.Int64()})) gives {'a': None}, not None #16696

Closed

2 tasks

cjackal mentioned this issue Jun 22, 2024

Feature Request: add to df.write_ndjson(json_columns: list[str]), for columns to be decoded and written out as JSON #17054

Closed

dycw mentioned this issue Aug 9, 2024

Nullability of Struct series #18119

Closed

2 tasks

coastalwhite closed this as completed Oct 30, 2024

github-project-automation bot moved this from Ready to Done in Backlog Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a validity buffer for `Struct` columns #3462

Introduce a validity buffer for `Struct` columns #3462

cbilot commented May 21, 2022 •

edited

Loading

cbilot commented May 21, 2022 •

edited

Loading

ritchie46 commented May 22, 2022

0x26res commented Mar 9, 2023

stinodego commented Mar 26, 2024 •

edited

Loading

NickCrews commented Jun 4, 2024

coastalwhite commented Jun 10, 2024

coastalwhite commented Oct 30, 2024

Introduce a validity buffer for Struct columns #3462

Introduce a validity buffer for Struct columns #3462

Comments

cbilot commented May 21, 2022 • edited Loading

Describe your bug.

None as an element in a Series

Structs with all fields set to None

is_null returns true for a struct element with all fields set to None

Setting a Series element to None.

Other Notes

cbilot commented May 21, 2022 • edited Loading

ritchie46 commented May 22, 2022

0x26res commented Mar 9, 2023

stinodego commented Mar 26, 2024 • edited Loading

NickCrews commented Jun 4, 2024

coastalwhite commented Jun 10, 2024

coastalwhite commented Oct 30, 2024

Introduce a validity buffer for `Struct` columns #3462

Introduce a validity buffer for `Struct` columns #3462

cbilot commented May 21, 2022 •

edited

Loading

`is_null` returns true for a struct element with all fields set to None

cbilot commented May 21, 2022 •

edited

Loading

stinodego commented Mar 26, 2024 •

edited

Loading