Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for median #1212

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

AlessandroMiola
Copy link
Contributor

@AlessandroMiola AlessandroMiola commented Oct 17, 2024

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

WIP: I would need to better dig into details for pyarrow and dask. Already opened the PR to (also) get advices and/or guidance:)

My understanding per now (might be missing bits, though) is the following:

  • neither pyarrow nor dask do implement a "proper" median; otoh, they respectively implement approximate_median and median_approximate (dask's median_approximate fails tests when npartitions=2, though)
  • relying on already implemented quantile(s) would come with the following issues: for pyarrow, the issue of having an interpolation parameter which would not quite fit the nw.median() signature; for dask, the issue of having to hardcode a "linear" interpolation strategy and having to xfail tests as well when encountering dask_lazy_p2_constructor.

Also, need to revise docstrings so as to comply with #1000 for median.

@github-actions github-actions bot added the enhancement New feature or request label Oct 17, 2024
@FBruzzesi
Copy link
Member

FBruzzesi commented Oct 18, 2024

Hey @AlessandroMiola thanks for the effort! This already looks very promising.

neither pyarrow nor dask do implement a "proper" median; otoh, they respectively implement approximate_median and median_approximate

I think that's good enough as long as we document that results may slightly differ between backends because of the difference of underlying algorithms used

dask's median_approximate fails tests when npartitions=2, though

This may need some investigation

Also, need to revise docstrings so as to comply with #1000 for median.

This is much appreciated 😁

I have a couple of additional considerations:

  • Polars on string series returns None, pandas raises an error, I don't know about others - should we check that the input is numeric and limit the functionality to that? cc @MarcoGorelli
  • Can we check that all algorithms ignore nulls/nans?

@AlessandroMiola
Copy link
Contributor Author

Thanks for your help @FBruzzesi! :) I'll try to address all of your comments!

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, looks really good!

regarding

Polars on string series returns None,

Polars does indeed raise here for Expr:

In [10]: df
Out[10]:
shape: (3, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64str │
╞═════╪═════╡
│ 1f   │
│ 2a   │
│ 3x   │
└─────┴─────┘

In [11]: df.select(pl.median('b'))
---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[11], line 1
----> 1 df.select(pl.median('b'))

File ~/scratch/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:9010, in DataFrame.select(self, *exprs, **named_exprs)
   8910 def select(
   8911     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   8912 ) -> DataFrame:
   8913     """
   8914     Select columns from this DataFrame.
   8915
   (...)
   9008     └──────────────┘
   9009     """
-> 9010     return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)

File ~/scratch/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2050, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2048 # Only for testing purposes
   2049 callback = _kwargs.get("post_opt_callback", callback)
-> 2050 return wrap_df(ldf.collect(callback))

InvalidOperationError: `median` operation not supported for dtype `str`

I find it a bit odd that Series.median doesn't raise for Polars, and think it'd be ok for us to just always raise

narwhals/_polars/namespace.py Outdated Show resolved Hide resolved
@AlessandroMiola
Copy link
Contributor Author

I find it a bit odd that Series.median doesn't raise for Polars, and think it'd be ok for us to just always raise

A (possibly silly) question on my side for clarification. Should we make it so that it passes tests like the ones below (thus raising a custom error within the underlying backends) or should we just raise where it natively does not while keeping native behaviour as is?
I would have gone the first way, but I'm all ears! :D

data = {
    "a": [3, 8, 2, None],
    "b": [5, 5, None, 7],
    "z": [7.0, 8, 9, None],
    "s": ["f", "a", "x", "x"],
}

@pytest.mark.parametrize(
    "expr", [nw.col("s").median(), nw.median("s")]
)
def test_median_expr_raises_on_str(constructor: Constructor, expr: nw.Expr) -> None:
    df = nw.from_native(constructor(data))
    with pytest.raises(
        TypeError,
        match="`median` operation not supported for non-numeric input type."
    ):
        df.select(expr)

@pytest.mark.parametrize(("col"), [("s")])
def test_median_series_raises_on_str(
    constructor_eager: Any,
    col: str,
) -> None:
    series = nw.from_native(constructor_eager(data), eager_only=True)[col]
    with pytest.raises(
        TypeError,
        match="`median` operation not supported for non-numeric input type."
    ):
        series.median()

Please correct me if I'm completely off-road! Thanks

@FBruzzesi
Copy link
Member

A (possibly silly) question on my side for clarification. Should we make it so that it passes tests like the ones below (thus raising a custom error within the underlying backends) or should we just raise where it natively does not while keeping native behaviour as is? I would have gone the first way, but I'm all ears! :D

Hey @AlessandroMiola , I would personally opt for the second option: you could create a InvalidOperationError class in _exceptions.py and raise that one. For series it would be even possible to do so in narwhals.Series so that the type checking and raise does not get duplicated. For narwhals.Expr I would not know how to do it directly though, since datatype is not know until computation in polars.

So maybe first option 😂

@FBruzzesi
Copy link
Member

Hey @AlessandroMiola , I think this is fairly ready for review, however in #1224 , compare_dicts was renamed to assert_equal_data and that's where the error in CI comes from.

@AlessandroMiola
Copy link
Contributor Author

Hey @AlessandroMiola , I think this is fairly ready for review, however in #1224 , compare_dicts was renamed to assert_equal_data and that's where the error in CI comes from.

Hi @FBruzzesi, thanks for having a look. Besides what you mention, I still have uncommitted changes related to:

  • Can we check that all algorithms ignore nulls/nans?

This is indeed pretty straightforward (as all backends seem to natively do it) and potentially ready for being committed and pushed.

I think that's good enough as long as we document that results may slightly differ between backends because of the difference of underlying algorithms used

As above.

A (possibly silly) question on my side for clarification. Should we make it so that it passes tests like the ones below (thus raising a custom error within the underlying backends) or should we just raise where it natively does not while keeping native behaviour as is? I would have gone the first way, but I'm all ears! :D

Hey @AlessandroMiola , I would personally opt for the second option: you could create a InvalidOperationError class in _exceptions.py and raise that one. For series it would be even possible to do so in narwhals.Series so that the type checking and raise does not get duplicated. For narwhals.Expr I would not know how to do it directly though, since datatype is not know until computation in polars.

Here, I see your point and I indeed had some issues in figuring out how to deal with Expr at the "narwhals-level" (if even possible), but I would need to revise everything

dask's median_approximate fails tests when npartitions=2, though

This may need some investigation

On this, I haven't started yet :|

Later today and tomorrow I should have time, so I'll hopefully update you and/or push something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Expr.median / Series.median
3 participants