Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array function to polars expressions #19079

Closed
wants to merge 10 commits into from

Conversation

corwinjoy
Copy link
Contributor

@corwinjoy corwinjoy commented Oct 3, 2024

Intent:
Add a function to create a new array column from a set of input columns (similar to concat_list) as discussed in #18090 and #8510.

This is a draft PR for discussion and to see what such a function would look like. The exact semantics have not been fully fleshed out.

@corwinjoy
Copy link
Contributor Author

Here are some of the issues and semantics I am uncertain about:

  1. I believe the array function should take an optional DataType in order to create an array of a particular target type. I have done this in an awkward way, and could use help here on how to pass a DataType down to an expression. (It seems like I can use Wrap<DataType> at the top level but not down an the expression level). When a target type is specified, it will try to cast/convert columns to that target type.
  2. If a target type is not specified, right now I use the first column as the target type. Another option could be to use try_get_supertype but this is a very broad upcast and I'm not convinced this function should apply beyond primitive types.
  3. Right now, this function only works on numeric types. Probably String types should be added but I'm not quite sure how to create a String Array since the builders seem to only work on numeric types.
  4. Should this function work on Array and List columns? If so, it should probably flatten lists by one level as done by concat_list and discussed in Split off list constructor logic from pl.concat_list into pl.list #8510. That is array(array(pl.all())) = array(pl.all()). In other words, in previous discussions, it was mentioned that this function should be idempotent.

I think those are the main questions I had about the design and the other parts are more polishing the code since I am not all that expert in the existing code base. Also, if we are adding array perhaps a new list function should be created at the same time in line with #8510 (comment) ?

"-C", "link-arg=-L.../pyenv.git/versions/3.10.15/lib",
"-C", "link-arg=-lpython3.10",
]

Copy link
Contributor Author

@corwinjoy corwinjoy Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. This is temporary developer scaffolding so I can run and debug via cargo test directly. See e.g. https://stackoverflow.com/questions/78204333/how-to-run-rust-library-unit-tests-with-maturin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed. This shouldn't be commited.

Remove dead code in as_datatype.py
Fix minor syntax in array.rs
@corwinjoy
Copy link
Contributor Author

corwinjoy commented Oct 3, 2024

I guess the other way to write this would be via a call to concat_list. But then, you give up a lot of opportunities for type checking (e.g. you may not want to glue together ragged rows). I think part of the impetus for a new function here would be to have clearer and more explicit behavior when constructing arrays. To be explicit, below is an example of creating an array via concat_list. As mentioned, the problems I see with just reusing concat_list are efficiency and possibly unwanted behavior like wrapping rows.

df = pl.DataFrame(
    [
        pl.Series("f1", [1, 2]),
        pl.Series("f2", [3, None]),
    ],
    schema={
        "f1": pl.Int32,
        "f2": pl.Float64,
    },
)
result = print(df.with_columns(arr=pl.concat_list(pl.all()).list.to_array(2)))


shape: (2, 3)
┌─────┬──────┬───────────────┐
│ f1  ┆ f2   ┆ arr           │
│ --- ┆ ---  ┆ ---           │
│ i32 ┆ f64  ┆ array[f64, 2] │
╞═════╪══════╪═══════════════╡
│ 1   ┆ 3.0  ┆ [1.0, 3.0]    │
│ 2   ┆ null ┆ [2.0, null]   │
└─────┴──────┴───────────────┘

Copy link

@NickCrews NickCrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks like a great start. It's good to get these open questions out there.

I would vote for up casting when given mixed types, following the same logic of eg int64+uint32?

@@ -502,6 +502,30 @@ def concat_list(exprs: IntoExpr | Iterable[IntoExpr], *more_exprs: IntoExpr) ->
return wrap_expr(plr.concat_list(exprs))


def array(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this API be simply array(vals: Iterable[IntoExpr], *, datatype:...)? Then it is the same semantics of pythons builtin list(), and it is unambiguous as to whether or not we remove a level of nesting if we pass in multiple list expressions (we don't).

Copy link
Contributor Author

@corwinjoy corwinjoy Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NickCrews, I think I agree with you on this point but I'm a bit fuzzy on the endpoint. That is, I think it would help to exemplify to understand exactly what syntax would be allowed. Let's say we have a signature of array(vals: Iterable[IntoExpr], datatype:...) - here I'm dropping the second argument of * since I think we want just the one named argument.
So we can have expressions like
pl.array(['a', 'b', 'c']) for a set of columns
and expressions like pl.array([f"A_lag_{i}" for i in range(3)][::-1]) (from the concat_list docs for a rolling window).

But where I start to get fuzzy is things like column globs. I think it is valuable to have syntax like
pl.array(pl.all())
The all function just returns a single expression (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.all.html#polars.all) so this may not be compatible with the idea of taking an iterable. (Although I am not clear on this point since there is an interaction with the python bindings and whether "wildcard expansion" is applied).
Anyway, I think there are a few permutations here and I think it would be helpful to have some concrete examples and how they interact with the expression syntax.

from polars.testing import assert_series_equal


def test_array() -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would eventually love tests for

  • an empty list with no datatype errors
  • an empty list with a datatype works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I need to do some investigation on whether polars supports 0 length arrays since this is what I would expect out of an empty list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickCrews so, from what I can tell, there does not seem to be the concept of an empty Array in polars. That is to say, the rows in an array must have a positive length.
So, for example, we can have empty lists:

s = pl.Series("A", [[], []], dtype = pl.List(pl.Int64))
print(s)

shape: (2,)
Series: 'A' [list[i64]]
[
	[]
	[]
]

If we try to convert this to an array we get an error:

a = s.list.to_array(0)

Gives the error: polars.exceptions.ComputeError: not all elements have the specified width 0

Similarly, we can't directly construct an empty array. It seems the closest we can get is:

s = pl.Series("A", [[None], [None]], dtype = pl.Array(pl.Int64, 1))
print(s)

shape: (2,)
Series: 'A' [array[i64, 1]]
[
	[null]
	[null]
]

For now, if we call this function with an empty array like
df.select(pl.array([], dtype='{"DtypeColumn":["Float64"]}').alias("z"))["z"]
what we get is
polars.exceptions.ComputeError: 'array' needs one or more expressions

At any rate, right now it is not clear what we should return for an empty array other than an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty array was added in #18940 (1.9.0 release)

s.list.to_array(0)
# shape: (2,)
# Series: 'A' [array[i64, 0]]
# [
# 	[]
# 	[]
# ]

@NickCrews
Copy link

Should this function work on Array and List columns?

I think this is a high priority for me.

If so, it should probably flatten lists by one level as done by concat_list

Can you elaborate why you think this is needed? I think we should try to avoid this, and I think if we are careful with our API (ie being more strict with what we accept, so it's not ambiguous how we will interpret it) then this is avoidable. ie the flexible API of array(arg_or_args: IntoExpr | Iterable[IntoExpr], *more_args: IntoExpr) is dangerous, we either need to go full variadic with array(*args: IntoExpr) or full monadic with array(args: Iterable[IntoExpr])

@corwinjoy
Copy link
Contributor Author

Should this function work on Array and List columns?

I think this is a high priority for me.

If so, it should probably flatten lists by one level as done by concat_list

Can you elaborate why you think this is needed? I think we should try to avoid this, and I think if we are careful with our API (ie being more strict with what we accept, so it's not ambiguous how we will interpret it) then this is avoidable. ie the flexible API of array(arg_or_args: IntoExpr | Iterable[IntoExpr], *more_args: IntoExpr) is dangerous, we either need to go full variadic with array(*args: IntoExpr) or full monadic with array(args: Iterable[IntoExpr])

I think it would be reasonable to give up the iterable syntax, (given that we cannot have empty arrays), and only support the variadic syntax of array(*args: IntoExpr). The reason for choosing this route would be to support glob expansions expressions like pl.all().

"-C", "link-arg=-L.../pyenv.git/versions/3.10.15/lib",
"-C", "link-arg=-lpython3.10",
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed. This shouldn't be commited.

@@ -41,8 +41,9 @@ rayon = { workspace = true }
recursive = { workspace = true }
regex = { workspace = true, optional = true }
serde = { workspace = true, features = ["rc"], optional = true }
serde_json = { workspace = true, optional = true }
serde_json = { workspace = true, optional = false }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't add a new ArrayVec dependency. And serde_json should not be a required dependency here.

Adding dependencies is not something we want ideally.


pub fn array_from_expr<E: AsRef<[IE]>, IE: Into<Expr> + Clone>(
s: E,
dtype_str: &str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we get a string here?

pub struct ArrayKwargs {
// Not sure how to get a serializable DataType here
// For prototype, use fixed size string
pub dtype_expr: ArrayString<256>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a string here?

@corwinjoy
Copy link
Contributor Author

Closing this for now. I think the spec and how this should behave is too unclear. For now, concat_list(..).list.to_array(..) is workable and I think this can be revised when the Polars team has the chance to firm up what these functions should look like.

@corwinjoy corwinjoy closed this Nov 1, 2024
@NickCrews
Copy link

Thank you @corwinjoy for the effort though! this prior art is going to be invaluable for when someone does get to solving this. Thanks @ritchie46 for the feedback too. If you ever want to spend more time firming up the API/semantics should actually look like, I would be willing to write up a concrete spec/rationale for people to comment on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants