Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add linear_space #20678

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Jan 12, 2025

First pass at #5255.

Update: linear space with temporals is now implemented, along with an f32 path, and stronger testing.

import polars as pl
from datetime import date, datetime, time, timedelta

pl.linear_space(date(2025, 1, 1), date(2025, 2, 1), 3, eager=True)
# shape: (3,)
# Series: 'literal' [datetime[ms]]
# [
#         2025-01-01 00:00:00
#         2025-01-16 12:00:00
#         2025-02-01 00:00:00
# ]

pl.linear_space(pl.lit(0, pl.Float32), pl.lit(1, pl.Float32), 5, eager=True)
# shape: (5,)
# Series: 'literal' [f32]
# [
#         0.0
#         0.25
#         0.5
#         0.75
#         1.0
# ]

I kept the name linear_space since that's a more canonical polars name, although linspace is far more recognizable. The usage is fairly simple, and the closed is one of both (default), left, right, or none. In each case, the number of samples asked for is provided.

import polars as pl

pl.linear_space(0, 5, 3, closed="both", eager=True)
# shape: (3,)
# Series: 'literal' [f64]
# [
#         0.0
#         2.5
#         5.0
# ]

pl.linear_space(0, 1, 3, closed="left", eager=True)
# shape: (3,)
# Series: 'literal' [f64]
# [
#         0.0
#         0.333333
#         0.666667
# ]

>>> pl.linear_space(0, 1, 5, closed="none", eager=True)
# shape: (5,)
# Series: 'literal' [f64]
# [
#         0.166667
#         0.333333
#         0.5
#         0.666667
#         0.833333
# ]

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 12, 2025

#[derive(Copy, Clone, Debug, Hash, Eq, PartialEq, Default, IntoStaticStr)]
Copy link
Contributor Author

@mcrumiller mcrumiller Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved ClosedInterval into linear_space since it's feature-gated here and I wanted to re-use it. I'm not sure if there's a better place for it.

@mcrumiller mcrumiller marked this pull request as ready for review January 13, 2025 03:26
Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 93.64162% with 11 lines in your changes missing coverage. Please review.

Project coverage is 79.71%. Comparing base (1cd72ff) to head (8d6f682).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-ops/src/series/ops/linear_space.rs 84.37% 10 Missing ⚠️
...tes/polars-plan/src/dsl/function_expr/range/mod.rs 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20678      +/-   ##
==========================================
- Coverage   79.92%   79.71%   -0.21%     
==========================================
  Files        1559     1564       +5     
  Lines      221181   222103     +922     
  Branches     2530     2531       +1     
==========================================
+ Hits       176774   177056     +282     
- Misses      43825    44465     +640     
  Partials      582      582              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mcrumiller
Copy link
Contributor Author

I don't think closed = "none" should exist.

I wasn't sure about that one myself. Will update.

@orlp
Copy link
Collaborator

orlp commented Jan 13, 2025

@mcrumiller I changed my mind right after and deleted my comment, but I guess I wasn't fast enough. I think it's alright to have - equidistant points over the range, not including either start or end.

I'm not sure about the name "none" though. Maybe it should be "neither"? Do we have precedence for a closed argument with "none" anywhere else?

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jan 13, 2025

I'm not sure about the name "none" though. Maybe it should be "neither"? Do we have precedence for a closed argument with "none" anywhere else?

I was thinking about this as well. A better-suited parameter name might be something like interval_type where we would specify closed-left, closed-right, closed, or open. The parameter name "closed" is essentially a question "how is the interval closed?" and the answer none is saying "it's not". It's a bit awkward.

I grabbed the interval enumeration from is_between. So in the interest of continuity it might be best to leave the not-great "closed" option, but I don't have a strong preference either way.

@orlp
Copy link
Collaborator

orlp commented Jan 14, 2025

Let's keep using the established closed terminology then.

@mcrumiller
Copy link
Contributor Author

Sure. Do you have any comments on improvements? Should we include a return 'dtype' parameter like in numpy's implementation?

@orlp
Copy link
Collaborator

orlp commented Jan 14, 2025

Rust side mostly looks good to me. I don't think we need a dtype parameter at this time.

However I think there's still a couple things missing around dtypes:

  • If the input types are all f32, I think the output should probably also be f32.

  • If the input types are all dates/times, I think the output should also be dates/times.

  • I think we should reject inputs which are not strictly numeric, strictly date or strictly time. You should not be able to specify a start date but end time for example.

@mcrumiller mcrumiller marked this pull request as draft January 14, 2025 12:51
@mcrumiller mcrumiller marked this pull request as ready for review January 14, 2025 21:18
@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jan 14, 2025

@orlp, latest round of updates addressing your suggestions:

  • If the input types are all f32, I think the output should probably also be f32.

I've added separate floating implementations (I originally went for generics but it felt messy to me, so I just made two separate implementations). Included new tests.

  • If the input types are all dates/times, I think the output should also be dates/times.

I've added a temporal path. start/end points with dates produces a Datetime("ms") (similar to mean/median) and all other temporal types require the same temporal type. If the start/end data types do not match exactly (except for numeric, which are fine to mix and match) (even the time precision units are required to match), we raise.

  • I think we should reject inputs which are not strictly numeric, strictly date or strictly time. You should not be able to specify a start date but end time for example.

Done--only numeric/temporal work. Again, the start/end must match, and anything outside of these dtypes throws an Exception.

closed: ClosedInterval = "both",
eager: bool = False,
) -> Expr | Series:
"""Linearly-spaced elements."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the documentation could be a bit better, perhaps with an example for start = 0, end = 1, num_samples = 4 for each of the closed options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep forgot docstrings.

@mcrumiller mcrumiller marked this pull request as draft January 15, 2025 13:01
@mcrumiller
Copy link
Contributor Author

@orlp I think we may want num_samples to be expressionable, as a somewhat obvious use case is to generate something like pl.linspace(0, 1, pl.len()), e.g. use the height of the frame as the number of samples. Do you agree?

@orlp
Copy link
Collaborator

orlp commented Jan 15, 2025

@mcrumiller Yes, I agree, but it must resolve to a single value, similar to this error:

>>> df = pl.DataFrame({"x": [1, 2, 3]})
>>> df.select(pl.int_range(0, pl.col.x))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/orlp/.localpython/lib/python3.11/site-packages/polars/dataframe/frame.py", line 9323, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/orlp/.localpython/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2057, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: `end` must contain exactly one value, got 3 values

@mcrumiller mcrumiller marked this pull request as ready for review January 16, 2025 00:26
@mcrumiller
Copy link
Contributor Author

@orlp updated num_samples to be an expression that only allows one value, added some tests as well, and updated the docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants