Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: skew #1173

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

feat: skew #1173

wants to merge 14 commits into from

Conversation

CarloLepelaars
Copy link
Contributor

@CarloLepelaars CarloLepelaars commented Oct 14, 2024

This PR adds skew to Narwhals. Support is added for Polars, Pandas-like, Arrow and Dask.

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

@CarloLepelaars CarloLepelaars changed the title Skewness feat: skew Oct 14, 2024
@CarloLepelaars CarloLepelaars changed the title feat: skew feat: skew Oct 14, 2024
@github-actions github-actions bot added the enhancement New feature or request label Oct 14, 2024
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome effort, thanks @CarloLepelaars , good to have you as contributor! Looks like there's a doctest failure

@CarloLepelaars
Copy link
Contributor Author

Thanks for the kind words! Doctest should be fixed now.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating, just left some comments (i'm a little tired today though so sorry if my comments don't make sense 😅 )

narwhals/_arrow/series.py Outdated Show resolved Hide resolved
narwhals/_pandas_like/series.py Outdated Show resolved Hide resolved
narwhals/expr.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Member

btw, if you wanted to just fix a typo somewhere in a separate pr (or, say, take #1170), then once you're already a contributor, CI will always run automatically without me having to approve and run - just bringing this up in case it makes it easier for you

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @CarloLepelaars, thanks for the PR!

I left a few comments - the main challenge seems to be how different implementations are between pandas and polars native methods. However polars provide the formula it uses for the computation. It should be possible to reproduce that with native methods or using the series/expr methods that are already implemented in narwhals :)

narwhals/_arrow/namespace.py Outdated Show resolved Hide resolved
@@ -298,6 +299,17 @@ def std(self, ddof: int = 1) -> int:

return pc.stddev(self._native_series, ddof=ddof) # type: ignore[no-any-return]

def skew(self) -> float:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it would end up returning a pyarrow scalar, I think we should keep the implementation with native methods, or you can reuse methods implemented, such as all elementary operations

narwhals/_pandas_like/namespace.py Outdated Show resolved Hide resolved
narwhals/_pandas_like/series.py Show resolved Hide resolved
narwhals/_polars/namespace.py Outdated Show resolved Hide resolved
narwhals/expr.py Outdated Show resolved Hide resolved
narwhals/expr.py Show resolved Hide resolved
narwhals/_pandas_like/series.py Outdated Show resolved Hide resolved
@@ -519,6 +519,40 @@ def mean(self) -> Any:
"""
return self._compliant_series.mean()

def skew(self) -> Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as Expr.skew, polars exposes a bias parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See conversation in narwhals/expr.py

@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Oct 14, 2024

Hey @CarloLepelaars, thanks for the PR!

I left a few comments - the main challenge seems to be how different implementations are between pandas and polars native methods. However polars provide the formula it uses for the computation. It should be possible to reproduce that with native methods or using the series/expr methods that are already implemented in narwhals :)

This is indeed challenging @FBruzzesi. I've made it so every backend returns the biased population skewness, but we can potentially include an option for the unbiased skewness.

@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Oct 17, 2024

Hmm, any idea what this last error for Marimo Python 3.12 is about? This is the only workflow breaking.

FAILED tests/_plugins/ui/_impl/tables/test_narwhals.py::TestNarwhalsTableManagerFactory::test_complex_data_field_types - TypeError: write() argument must be str, not dict

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @CarloLepelaars thanks for adjusting! This looks better now!

I left a comment for the pyarrow case, and I have other two considerations:

  • Should we account for the len(ser) < 3 case and return 0?
  • It may be worth checking that the numbers are same even when nulls are present

narwhals/_arrow/series.py Outdated Show resolved Hide resolved
narwhals/series.py Outdated Show resolved Hide resolved
@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Oct 18, 2024

  • Should we account for the len(ser) < 3 case and return 0?

Let's see, this is where Pandas diverges from the rest. To make it consistent we should only handle the case where len(data)==2. In that case Pandas and PyArrow can return 0. Do you also think that is the way to go?

I thought that Pandas uses the SciPy implementation of skew under the hood, but apparently they are different?

>>> sample_data = [2, 10]
>>> scipy_skew = skew(sample_data)
>>> pandas_skew = pd.Series(sample_data).skew()
>>> polars_skew = pl.Series(sample_data).skew()
>>> print("Skewness for 2 elements:")
>>> print(f"SciPy:  {scipy_skew:.6f}")
>>> print(f"Pandas: {pandas_skew:.6f}")
>>> print(f"Polars: {polars_skew:.6f}")

Skewness for 2 elements:
SciPy:  0.000000
Pandas: nan
Polars: 0.000000
# ----------------------------------------------
>>> sample_data = [2]
>>> scipy_skew = skew(sample_data)
>>> pandas_skew = pd.Series(sample_data).skew()
>>> polars_skew = pl.Series(sample_data).skew()
>>> print("Skewness for 2 elements:")
>>> print(f"SciPy:  {scipy_skew:.6f}")
>>> print(f"Pandas: {pandas_skew:.6f}")
>>> print(f"Polars: {polars_skew:.6f}")

Skewness for 1 element:
SciPy:  nan
Pandas: nan
Polars: nan
  • It may be worth checking that the numbers are same even when nulls are present

Good one! Can add a case in unary_test.py that has nulls.

@FBruzzesi
Copy link
Member

FBruzzesi commented Oct 18, 2024

Let's see, this is where Pandas diverges from the rest. To make it consistent we should only handle the case where len(data)==2. In that case Pandas and PyArrow can return 0. Do you also think that is the way to go?

Yes, we are trying to stick with polars api and behavior, so let's manually force that if needed!

Good one! Can add a case in unary_test.py that has nulls.

That would be great - if it is too much though, we can also make it in a follow up PR

@CarloLepelaars
Copy link
Contributor Author

@FBruzzesi

I've covered the cases as discussed and made them consistent with Polars behavior. unary_test.py now also covers data with nan and cases where there are less than 3 rows.

@FBruzzesi
Copy link
Member

FBruzzesi commented Oct 18, 2024

I've covered the cases as discussed and made them consistent with Polars behavior. unary_test.py now also covers data with nan and cases where there are less than 3 rows.

Thanks for addressing the cases, the CI failure seems unrelated.

However I am still not quite sure that we are matching polars behavior. When counting number of elements for the base cases, we should ignore null values, then (pseudo code):

if n_not_nulls==0:
    return None   # same as pl.Series([]).skew() and pl.Series([None]).skew()
elif n_not_nulls==1:
    return float("nan")  # same as pl.Series([1]).skew() and pl.Series([1, None]).skew()
elif n_not_nulls==2:
    return 0.0  # same as pl.Series([1, 2]).skew() and pl.Series([1, 2, None]).skew()
else:
    return <compute_skew>

@CarloLepelaars
Copy link
Contributor Author

Implemented your suggestions for nan policy. There is only one edge case left for Dask, where it outputs nan instead of 0.0 with 2 non null elements. Not sure how to adjust _dask/expr.py to account for that.

@FBruzzesi
Copy link
Member

Hey @CarloLepelaars, thanks for adjusting. CI is failing because in #1224 , compare_dicts was renamed to assert_equal_data.

Implemented your suggestions for nan policy. There is only one edge case left for Dask, where it outputs nan instead of 0.0 with 2 non null elements. Not sure how to adjust _dask/expr.py to account for that.

Regarding dask, I am not able to try it now, bif could definitly be a tricky one to get right! I am ok with marking it as xfail in tests for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants