-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add array function to polars expressions #19079
Conversation
Here are some of the issues and semantics I am uncertain about:
I think those are the main questions I had about the design and the other parts are more polishing the code since I am not all that expert in the existing code base. Also, if we are adding |
"-C", "link-arg=-L.../pyenv.git/versions/3.10.15/lib", | ||
"-C", "link-arg=-lpython3.10", | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI. This is temporary developer scaffolding so I can run and debug via cargo test
directly. See e.g. https://stackoverflow.com/questions/78204333/how-to-run-rust-library-unit-tests-with-maturin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed. This shouldn't be commited.
Remove dead code in as_datatype.py Fix minor syntax in array.rs
I guess the other way to write this would be via a call to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, looks like a great start. It's good to get these open questions out there.
I would vote for up casting when given mixed types, following the same logic of eg int64+uint32?
@@ -502,6 +502,30 @@ def concat_list(exprs: IntoExpr | Iterable[IntoExpr], *more_exprs: IntoExpr) -> | |||
return wrap_expr(plr.concat_list(exprs)) | |||
|
|||
|
|||
def array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this API be simply array(vals: Iterable[IntoExpr], *, datatype:...)? Then it is the same semantics of pythons builtin list(), and it is unambiguous as to whether or not we remove a level of nesting if we pass in multiple list expressions (we don't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NickCrews, I think I agree with you on this point but I'm a bit fuzzy on the endpoint. That is, I think it would help to exemplify to understand exactly what syntax would be allowed. Let's say we have a signature of array(vals: Iterable[IntoExpr], datatype:...)
- here I'm dropping the second argument of *
since I think we want just the one named argument.
So we can have expressions like
pl.array(['a', 'b', 'c'])
for a set of columns
and expressions like pl.array([f"A_lag_{i}" for i in range(3)][::-1])
(from the concat_list docs for a rolling window).
But where I start to get fuzzy is things like column globs. I think it is valuable to have syntax like
pl.array(pl.all())
The all
function just returns a single expression (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.all.html#polars.all) so this may not be compatible with the idea of taking an iterable. (Although I am not clear on this point since there is an interaction with the python bindings and whether "wildcard expansion" is applied).
Anyway, I think there are a few permutations here and I think it would be helpful to have some concrete examples and how they interact with the expression syntax.
from polars.testing import assert_series_equal | ||
|
||
|
||
def test_array() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would eventually love tests for
- an empty list with no datatype errors
- an empty list with a datatype works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I need to do some investigation on whether polars supports 0 length arrays since this is what I would expect out of an empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickCrews so, from what I can tell, there does not seem to be the concept of an empty Array in polars. That is to say, the rows in an array must have a positive length.
So, for example, we can have empty lists:
s = pl.Series("A", [[], []], dtype = pl.List(pl.Int64))
print(s)
shape: (2,)
Series: 'A' [list[i64]]
[
[]
[]
]
If we try to convert this to an array we get an error:
a = s.list.to_array(0)
Gives the error: polars.exceptions.ComputeError: not all elements have the specified width 0
Similarly, we can't directly construct an empty array. It seems the closest we can get is:
s = pl.Series("A", [[None], [None]], dtype = pl.Array(pl.Int64, 1))
print(s)
shape: (2,)
Series: 'A' [array[i64, 1]]
[
[null]
[null]
]
For now, if we call this function with an empty array like
df.select(pl.array([], dtype='{"DtypeColumn":["Float64"]}').alias("z"))["z"]
what we get is
polars.exceptions.ComputeError: 'array' needs one or more expressions
At any rate, right now it is not clear what we should return for an empty array other than an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The empty array was added in #18940 (1.9.0 release)
s.list.to_array(0)
# shape: (2,)
# Series: 'A' [array[i64, 0]]
# [
# []
# []
# ]
I think this is a high priority for me.
Can you elaborate why you think this is needed? I think we should try to avoid this, and I think if we are careful with our API (ie being more strict with what we accept, so it's not ambiguous how we will interpret it) then this is avoidable. ie the flexible API of array(arg_or_args: IntoExpr | Iterable[IntoExpr], *more_args: IntoExpr) is dangerous, we either need to go full variadic with array(*args: IntoExpr) or full monadic with array(args: Iterable[IntoExpr]) |
I think it would be reasonable to give up the iterable syntax, (given that we cannot have empty arrays), and only support the variadic syntax of |
"-C", "link-arg=-L.../pyenv.git/versions/3.10.15/lib", | ||
"-C", "link-arg=-lpython3.10", | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed. This shouldn't be commited.
@@ -41,8 +41,9 @@ rayon = { workspace = true } | |||
recursive = { workspace = true } | |||
regex = { workspace = true, optional = true } | |||
serde = { workspace = true, features = ["rc"], optional = true } | |||
serde_json = { workspace = true, optional = true } | |||
serde_json = { workspace = true, optional = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't add a new ArrayVec dependency. And serde_json
should not be a required dependency here.
Adding dependencies is not something we want ideally.
|
||
pub fn array_from_expr<E: AsRef<[IE]>, IE: Into<Expr> + Clone>( | ||
s: E, | ||
dtype_str: &str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we get a string here?
pub struct ArrayKwargs { | ||
// Not sure how to get a serializable DataType here | ||
// For prototype, use fixed size string | ||
pub dtype_expr: ArrayString<256>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have a string here?
Closing this for now. I think the spec and how this should behave is too unclear. For now, |
Thank you @corwinjoy for the effort though! this prior art is going to be invaluable for when someone does get to solving this. Thanks @ritchie46 for the feedback too. If you ever want to spend more time firming up the API/semantics should actually look like, I would be willing to write up a concrete spec/rationale for people to comment on. |
Intent:
Add a function to create a new array column from a set of input columns (similar to concat_list) as discussed in #18090 and #8510.
This is a draft PR for discussion and to see what such a function would look like. The exact semantics have not been fully fleshed out.