-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for median
#1212
base: main
Are you sure you want to change the base?
Conversation
Hey @AlessandroMiola thanks for the effort! This already looks very promising.
I think that's good enough as long as we document that results may slightly differ between backends because of the difference of underlying algorithms used
This may need some investigation
This is much appreciated 😁 I have a couple of additional considerations:
|
Thanks for your help @FBruzzesi! :) I'll try to address all of your comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, looks really good!
regarding
Polars on string series returns None,
Polars does indeed raise here for Expr:
In [10]: df
Out[10]:
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ f │
│ 2 ┆ a │
│ 3 ┆ x │
└─────┴─────┘
In [11]: df.select(pl.median('b'))
---------------------------------------------------------------------------
InvalidOperationError Traceback (most recent call last)
Cell In[11], line 1
----> 1 df.select(pl.median('b'))
File ~/scratch/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:9010, in DataFrame.select(self, *exprs, **named_exprs)
8910 def select(
8911 self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
8912 ) -> DataFrame:
8913 """
8914 Select columns from this DataFrame.
8915
(...)
9008 └──────────────┘
9009 """
-> 9010 return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
File ~/scratch/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2050, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
2048 # Only for testing purposes
2049 callback = _kwargs.get("post_opt_callback", callback)
-> 2050 return wrap_df(ldf.collect(callback))
InvalidOperationError: `median` operation not supported for dtype `str`
I find it a bit odd that Series.median
doesn't raise for Polars, and think it'd be ok for us to just always raise
A (possibly silly) question on my side for clarification. Should we make it so that it passes tests like the ones below (thus raising a custom error within the underlying backends) or should we just raise where it natively does not while keeping native behaviour as is?
Please correct me if I'm completely off-road! Thanks |
Hey @AlessandroMiola , I would personally opt for the second option: you could create a So maybe first option 😂 |
ca18cf7
to
c49127e
Compare
Hey @AlessandroMiola , I think this is fairly ready for review, however in #1224 , |
Hi @FBruzzesi, thanks for having a look. Besides what you mention, I still have uncommitted changes related to:
This is indeed pretty straightforward (as all backends seem to natively do it) and potentially ready for being committed and pushed.
As above.
Here, I see your point and I indeed had some issues in figuring out how to deal with
On this, I haven't started yet :| Later today and tomorrow I should have time, so I'll hopefully update you and/or push something. |
c762c6d
to
6129a4e
Compare
80d0e74
to
80ecc0f
Compare
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.
WIP: I would need to better dig into details for
pyarrow
anddask
. Already opened the PR to (also) get advices and/or guidance:)My understanding per now (might be missing bits, though) is the following:
pyarrow
nordask
do implement a "proper"median
; otoh, they respectively implementapproximate_median
andmedian_approximate
(dask
'smedian_approximate
fails tests whennpartitions=2
, though)quantile
(s) would come with the following issues: forpyarrow
, the issue of having aninterpolation
parameter which would not quite fit thenw.median()
signature; for dask, the issue of having to hardcode a"linear"
interpolation strategy and having toxfail
tests as well when encounteringdask_lazy_p2_constructor
.Also, need to revise docstrings so as to comply with #1000 for
median
.