-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a validity buffer for Struct
columns
#3462
Comments
I should probably add ... I myself am not convinced that these two Series elements are (necessarily) different in all problem domains:
But might they be in some problem domains or data models? For example, the first may represent that an experiment was not performed (or not applicable); the second may represent that an experiment was performed, but that the information is not yet available. Regardless, if the goal is to support highly recursive structures from various sources and various problem domains, Polars may need the ability to maintain the distinction between the two concepts: a Series element of None (as is done with Series of other data types) versus a struct with all fields set to None. |
I need to think about this one a bit. This will needs some refactors on the way we have designed the I think we must enforce that idea. And this [{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}] should fail as it should be constructed as follows: [{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}] Then we could later decide if we need validity on structs as well. |
Hey, I have similar issues. I would like to point out that the behaviour of polars diverges from arrow on this point, and @cbilot suggestion would make the two converge. So in arrow:
left = pa.StructArray.from_arrays(
[pa.array(['Valid', 'Ignored'])],
names=['field1'],
mask=pa.array([False, True])
)
right = pa.StructArray.from_arrays(
[pa.array(['Valid', None])],
names=['field1'],
mask=pa.array([False, True])
)
assert left == right
with_nullable = pa.StructArray.from_arrays(
[pa.array(['Valid', ''])],
fields=[pa.field('field1', pa.string(), nullable=False)],
mask=pa.array([False, True])
) Vs this, which is not valid because the field1 is not nullable:
So when going from arrow to polar and back some information is lost because of these semantic differences import pyarrow as pa
import pyarrow.compute as pc
import polars as pl
field1_values = pa.array(['1', ''])
field2_values = pa.array([1, 0])
struct_array = pa.StructArray.from_arrays(
[field1_values, field2_values],
names=['field1', 'field2'],
mask=pa.array([False, True])
)
table = pa.table({"col1": struct_array})
assert table['col1'].combine_chunks().field(0).to_pylist() == ['1', ""], "raw nested value is Preserved, ignoring mask"
assert table['col1'].combine_chunks().field(1).to_pylist() == [1, 0], "raw nested value is Preserved, ignoring mask"
assert pc.struct_field(table['col1'], 0).to_pylist() == ['1', None], "struct_field overlay the mask on the raw value"
assert pc.struct_field(table['col1'], 1).to_pylist() == [1, None], "struct_field overlay the mask on the raw value"
df = pl.from_arrow(table)
assert df['col1'].is_null().to_list() == [False, True] == table['col1'].is_null().to_pylist(), "As expected"
table_from_polars = df.to_arrow()
assert table_from_polars['col1'].combine_chunks().field(0).to_pylist() == ['1', None], "Should be [1, '']"
assert table_from_polars['col1'].combine_chunks().field(1).to_pylist() == [1, None], "Should be [1, 0]"
assert table_from_polars['col1'].is_null().to_pylist() == [False, False], "Should be [True, False]" |
We could solve this by adding a validity buffer to Struct columns to fix this issue. However, this is not a simple change at all. We're not sure if that is the way to go yet. It currently does not have priority for us to implement this. |
Struct
columns
I just ran into this when implementing a PR in Ibis Not just arrow, but other backends such as duckdb, flink, and bigquery all have nullable structs. Polars is definitely the odd one out here. This change would make polars consistent with the rest of the ecosystem. I probably can't do the brunt of the work here, but happy to give a review! Thank you! |
Just as a note for later. If we end up implementing this, we should test the parquet writer/reader to also make sure that this roundtrips. |
This is now implemented. Closing. |
polars 0.13.37+ (compiled through edc4dbd)
python 3.10.4
Linux Mint 20.3
Describe your bug.
I would argue that the following two lists are not the same:
I would also argue that a Series based on these two lists should not be the same. However, Polars currently conflates these two ideas. Either list produces the same Series.
None as an element in a Series
Perhaps the easiest way to see this is to perform a round trip: from List of dictionaries --> Series of struct --> List of dictionaries.
I would argue that the round-trip should have produced:
Structs with all fields set to None
By contrast, a round-trip for a dictionary with all values set to None preserves the values.
is_null
returns true for a struct element with all fields set to NonePolars appears to treat a struct with all fields set to None as a
None
element in a Series.I would argue that the result should have been
False
: a struct with all fields set to None is not the same asNone
as an element of a Series.Setting a Series element to None.
Polars currently does not allow setting a Series element to
None
for a Series[struct].Other Notes
The above may also be applicable to Lists[struct] (as opposed to Series[struct]).
The text was updated successfully, but these errors were encountered: