feat: skew #1173

CarloLepelaars · 2024-10-14T15:51:04Z

This PR adds skew to Narwhals. Support is added for Polars, Pandas-like, Arrow and Dask.

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

MarcoGorelli

Awesome effort, thanks @CarloLepelaars , good to have you as contributor! Looks like there's a doctest failure

CarloLepelaars · 2024-10-14T16:29:58Z

Thanks for the kind words! Doctest should be fixed now.

MarcoGorelli

thanks for updating, just left some comments (i'm a little tired today though so sorry if my comments don't make sense 😅 )

narwhals/_arrow/series.py

narwhals/_pandas_like/series.py

narwhals/expr.py

MarcoGorelli · 2024-10-14T17:15:13Z

btw, if you wanted to just fix a typo somewhere in a separate pr (or, say, take #1170), then once you're already a contributor, CI will always run automatically without me having to approve and run - just bringing this up in case it makes it easier for you

FBruzzesi

Hey @CarloLepelaars, thanks for the PR!

I left a few comments - the main challenge seems to be how different implementations are between pandas and polars native methods. However polars provide the formula it uses for the computation. It should be possible to reproduce that with native methods or using the series/expr methods that are already implemented in narwhals :)

narwhals/_arrow/namespace.py

FBruzzesi · 2024-10-14T18:15:44Z

narwhals/_arrow/series.py

@@ -298,6 +299,17 @@ def std(self, ddof: int = 1) -> int:

        return pc.stddev(self._native_series, ddof=ddof)  # type: ignore[no-any-return]

+    def skew(self) -> float:


Although it would end up returning a pyarrow scalar, I think we should keep the implementation with native methods, or you can reuse methods implemented, such as all elementary operations

narwhals/_pandas_like/namespace.py

narwhals/_pandas_like/series.py

narwhals/_polars/namespace.py

narwhals/expr.py

narwhals/_pandas_like/series.py

FBruzzesi · 2024-10-14T18:21:53Z

narwhals/series.py

@@ -519,6 +519,40 @@ def mean(self) -> Any:
        """
        return self._compliant_series.mean()

+    def skew(self) -> Any:


Same as Expr.skew, polars exposes a bias parameter

See conversation in narwhals/expr.py

CarloLepelaars · 2024-10-14T20:03:46Z

Hey @CarloLepelaars, thanks for the PR!

I left a few comments - the main challenge seems to be how different implementations are between pandas and polars native methods. However polars provide the formula it uses for the computation. It should be possible to reproduce that with native methods or using the series/expr methods that are already implemented in narwhals :)

This is indeed challenging @FBruzzesi. I've made it so every backend returns the biased population skewness, but we can potentially include an option for the unbiased skewness.

CarloLepelaars · 2024-10-17T18:31:08Z

Hmm, any idea what this last error for Marimo Python 3.12 is about? This is the only workflow breaking.

FAILED tests/_plugins/ui/_impl/tables/test_narwhals.py::TestNarwhalsTableManagerFactory::test_complex_data_field_types - TypeError: write() argument must be str, not dict

FBruzzesi

Hey @CarloLepelaars thanks for adjusting! This looks better now!

I left a comment for the pyarrow case, and I have other two considerations:

Should we account for the len(ser) < 3 case and return 0?
It may be worth checking that the numbers are same even when nulls are present

narwhals/_arrow/series.py

narwhals/series.py

CarloLepelaars · 2024-10-18T13:04:07Z

Should we account for the len(ser) < 3 case and return 0?

Let's see, this is where Pandas diverges from the rest. To make it consistent we should only handle the case where len(data)==2. In that case Pandas and PyArrow can return 0. Do you also think that is the way to go?

I thought that Pandas uses the SciPy implementation of skew under the hood, but apparently they are different?

>>> sample_data = [2, 10]
>>> scipy_skew = skew(sample_data)
>>> pandas_skew = pd.Series(sample_data).skew()
>>> polars_skew = pl.Series(sample_data).skew()
>>> print("Skewness for 2 elements:")
>>> print(f"SciPy:  {scipy_skew:.6f}")
>>> print(f"Pandas: {pandas_skew:.6f}")
>>> print(f"Polars: {polars_skew:.6f}")

Skewness for 2 elements:
SciPy:  0.000000
Pandas: nan
Polars: 0.000000
# ----------------------------------------------
>>> sample_data = [2]
>>> scipy_skew = skew(sample_data)
>>> pandas_skew = pd.Series(sample_data).skew()
>>> polars_skew = pl.Series(sample_data).skew()
>>> print("Skewness for 2 elements:")
>>> print(f"SciPy:  {scipy_skew:.6f}")
>>> print(f"Pandas: {pandas_skew:.6f}")
>>> print(f"Polars: {polars_skew:.6f}")

Skewness for 1 element:
SciPy:  nan
Pandas: nan
Polars: nan

It may be worth checking that the numbers are same even when nulls are present

Good one! Can add a case in unary_test.py that has nulls.

FBruzzesi · 2024-10-18T13:33:07Z

Let's see, this is where Pandas diverges from the rest. To make it consistent we should only handle the case where len(data)==2. In that case Pandas and PyArrow can return 0. Do you also think that is the way to go?

Yes, we are trying to stick with polars api and behavior, so let's manually force that if needed!

Good one! Can add a case in unary_test.py that has nulls.

That would be great - if it is too much though, we can also make it in a follow up PR

CarloLepelaars · 2024-10-18T15:28:34Z

@FBruzzesi

I've covered the cases as discussed and made them consistent with Polars behavior. unary_test.py now also covers data with nan and cases where there are less than 3 rows.

FBruzzesi · 2024-10-18T21:12:44Z

I've covered the cases as discussed and made them consistent with Polars behavior. unary_test.py now also covers data with nan and cases where there are less than 3 rows.

Thanks for addressing the cases, the CI failure seems unrelated.

However I am still not quite sure that we are matching polars behavior. When counting number of elements for the base cases, we should ignore null values, then (pseudo code):

if n_not_nulls==0:
    return None   # same as pl.Series([]).skew() and pl.Series([None]).skew()
elif n_not_nulls==1:
    return float("nan")  # same as pl.Series([1]).skew() and pl.Series([1, None]).skew()
elif n_not_nulls==2:
    return 0.0  # same as pl.Series([1, 2]).skew() and pl.Series([1, 2, None]).skew()
else:
    return <compute_skew>

CarloLepelaars · 2024-10-23T13:18:43Z

Implemented your suggestions for nan policy. There is only one edge case left for Dask, where it outputs nan instead of 0.0 with 2 non null elements. Not sure how to adjust _dask/expr.py to account for that.

FBruzzesi · 2024-10-24T07:23:02Z

Hey @CarloLepelaars, thanks for adjusting. CI is failing because in #1224 , compare_dicts was renamed to assert_equal_data.

Implemented your suggestions for nan policy. There is only one edge case left for Dask, where it outputs nan instead of 0.0 with 2 non null elements. Not sure how to adjust _dask/expr.py to account for that.

Regarding dask, I am not able to try it now, bif could definitly be a tricky one to get right! I am ok with marking it as xfail in tests for now

Implement skew for Arrow, Pandas-like and Polars

90d9742

CarloLepelaars changed the title ~~Skewness~~ feat: skew Oct 14, 2024

CarloLepelaars changed the title ~~feat: skew~~ feat: skew Oct 14, 2024

github-actions bot added the enhancement New feature or request label Oct 14, 2024

MarcoGorelli reviewed Oct 14, 2024

View reviewed changes

Fix doctests

c82fec1

MarcoGorelli reviewed Oct 14, 2024

View reviewed changes

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

narwhals/_pandas_like/series.py Outdated Show resolved Hide resolved

narwhals/expr.py Outdated Show resolved Hide resolved

FBruzzesi reviewed Oct 14, 2024

View reviewed changes

CarloLepelaars added 2 commits October 14, 2024 21:43

Remove skew in namespace. Remove n > 3 requirement. Fix expr doc

e118e4d

Use biases population skewness

2530f81

CarloLepelaars added 5 commits October 15, 2024 18:09

Add pyarrow example for skew Expr

fc37529

Merge branch 'main' into feat/skew

be2f503

Fix: Add a_skew to schema

02fdb4c

Use native operation for PandasLikeSeries skew. Dask skew expr

895be9c

Use native pyarrow operations for skew

a3b71bc

Merge branch 'main' into feat/skew

9ed06d7

FBruzzesi reviewed Oct 17, 2024

View reviewed changes

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

narwhals/series.py Outdated Show resolved Hide resolved

Simplify arrow skew. non-trivial example for series.skew.

4ff077d

unary_test with nan data. 2 element and 1 element unary tests

11efd49

Fix doctest for Series skew

26a64f8

Make skew nan policy consistent with Polars

2014036

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: skew #1173

feat: skew #1173

CarloLepelaars commented Oct 14, 2024 •

edited

Loading

MarcoGorelli left a comment

CarloLepelaars commented Oct 14, 2024

MarcoGorelli left a comment

MarcoGorelli commented Oct 14, 2024

FBruzzesi left a comment

FBruzzesi Oct 14, 2024

FBruzzesi Oct 14, 2024

CarloLepelaars Oct 17, 2024

CarloLepelaars commented Oct 14, 2024 •

edited

Loading

CarloLepelaars commented Oct 17, 2024 •

edited

Loading

FBruzzesi left a comment •

edited

Loading

CarloLepelaars commented Oct 18, 2024 •

edited

Loading

FBruzzesi commented Oct 18, 2024 •

edited

Loading

CarloLepelaars commented Oct 18, 2024

FBruzzesi commented Oct 18, 2024 •

edited

Loading

CarloLepelaars commented Oct 23, 2024

FBruzzesi commented Oct 24, 2024

		@@ -298,6 +299,17 @@ def std(self, ddof: int = 1) -> int:

		return pc.stddev(self._native_series, ddof=ddof) # type: ignore[no-any-return]

		def skew(self) -> float:

feat: skew #1173

Are you sure you want to change the base?

feat: skew #1173

Conversation

CarloLepelaars commented Oct 14, 2024 • edited Loading

Checklist

MarcoGorelli left a comment

Choose a reason for hiding this comment

CarloLepelaars commented Oct 14, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Oct 14, 2024

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi Oct 14, 2024

Choose a reason for hiding this comment

FBruzzesi Oct 14, 2024

Choose a reason for hiding this comment

CarloLepelaars Oct 17, 2024

Choose a reason for hiding this comment

CarloLepelaars commented Oct 14, 2024 • edited Loading

CarloLepelaars commented Oct 17, 2024 • edited Loading

FBruzzesi left a comment • edited Loading

Choose a reason for hiding this comment

CarloLepelaars commented Oct 18, 2024 • edited Loading

FBruzzesi commented Oct 18, 2024 • edited Loading

CarloLepelaars commented Oct 18, 2024

FBruzzesi commented Oct 18, 2024 • edited Loading

CarloLepelaars commented Oct 23, 2024

FBruzzesi commented Oct 24, 2024

CarloLepelaars commented Oct 14, 2024 •

edited

Loading

CarloLepelaars commented Oct 14, 2024 •

edited

Loading

CarloLepelaars commented Oct 17, 2024 •

edited

Loading

FBruzzesi left a comment •

edited

Loading

CarloLepelaars commented Oct 18, 2024 •

edited

Loading

FBruzzesi commented Oct 18, 2024 •

edited

Loading

FBruzzesi commented Oct 18, 2024 •

edited

Loading