Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Silent dropping of nuisance columns in agg_list_like #43741

Merged
merged 11 commits into from
Sep 29, 2021
3 changes: 3 additions & 0 deletions doc/source/user_guide/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1045,6 +1045,9 @@ not noted for a particular column will be ``NaN``:
Mixed dtypes
++++++++++++

.. deprecated:: 1.4.0
Attempting to determine which columns cannot be aggregated and silently dropping them from the results is deprecated and will be removed in a future version. If any porition of the columns or operations provided fail, the call to ``.agg`` will raise.

When presented with mixed dtypes that cannot aggregate, ``.agg`` will only take the valid
aggregations. This is similar to how ``.groupby.agg`` works.

Expand Down
4 changes: 2 additions & 2 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -578,7 +578,7 @@ column, which produces an aggregated result with a hierarchical index:

.. ipython:: python

grouped.agg([np.sum, np.mean, np.std])
grouped[["C", "D"]].agg([np.sum, np.mean, np.std])


The resulting aggregations are named for the functions themselves. If you
Expand All @@ -597,7 +597,7 @@ For a grouped ``DataFrame``, you can rename in a similar manner:
.. ipython:: python

(
grouped.agg([np.sum, np.mean, np.std]).rename(
grouped[["C", "D"]].agg([np.sum, np.mean, np.std]).rename(
columns={"sum": "foo", "mean": "bar", "std": "baz"}
)
)
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ aggregations. This is similar to how groupby ``.agg()`` works. (:issue:`15015`)
df.dtypes

.. ipython:: python
:okwarning:

df.agg(['min', 'sum'])

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,7 @@ Other Deprecations
- Deprecated the ``squeeze`` argument to :meth:`read_csv`, :meth:`read_table`, and :meth:`read_excel`. Users should squeeze the DataFrame afterwards with ``.squeeze("columns")`` instead. (:issue:`43242`)
- Deprecated the ``index`` argument to :class:`SparseArray` construction (:issue:`23089`)
- Deprecated :meth:`.Rolling.validate`, :meth:`.Expanding.validate`, and :meth:`.ExponentialMovingWindow.validate` (:issue:`43665`)
- Deprecated silent dropping of columns that raised a ``TypeError``, ``DataError``, or some cases of ``ValueError`` in :class:`Series.aggregate` and :class:`DataFrame.aggregate` when used with a list (:issue:`43740`)

.. ---------------------------------------------------------------------------

Expand Down
24 changes: 19 additions & 5 deletions pandas/core/apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
FrameOrSeries,
)
from pandas.util._decorators import cache_readonly
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.cast import is_nested_object
from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -335,6 +336,7 @@ def agg_list_like(self) -> DataFrame | Series:

results = []
keys = []
failed_names = []

# degenerate case
if selected_obj.ndim == 1:
Expand All @@ -344,7 +346,7 @@ def agg_list_like(self) -> DataFrame | Series:
new_res = colg.aggregate(a)

except TypeError:
pass
failed_names.append(com.get_callable_name(a) or a)
else:
results.append(new_res)

Expand All @@ -358,20 +360,23 @@ def agg_list_like(self) -> DataFrame | Series:
for index, col in enumerate(selected_obj):
colg = obj._gotitem(col, ndim=1, subset=selected_obj.iloc[:, index])
try:
new_res = colg.aggregate(arg)
with warnings.catch_warnings(record=True) as w:
new_res = colg.aggregate(arg)
if len(w) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all warnings that happen here going to represent a failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah - thanks. No, probably not. Will inspect the warnings themselves.

Copy link
Member Author

@rhshadrach rhshadrach Sep 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got a bit hairy, I wasn't able to find a good way to capture and suppress only our warnings while passing through any other warnings and being able to determine if we raised. Supplying a filter within the catch_warnings context manager means we can't fell if we raised; using record=True there means all user warnings would also be suppressed.

If there are any suggestions for a better implementation, it would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you dont catch warnings at all here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running

def foo(x):
   return x.prod()
df = pd.DataFrame({'a': [1, 2], 'b': ['x', 'y'], 'c': ['z', 'w'], 'd': ['a', 'b']})
df.agg([foo, lambda x: x.sum()])

gives

FutureWarning: ['foo'] did not aggregate successfully.
FutureWarning: ['foo'] did not aggregate successfully.
FutureWarning: ['foo'] did not aggregate successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there's something like a filter="once" to prevent exactly that (and i think its the default)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - thanks. Digging into this, the only way I can see why I'm getting multiple warnings is because of https://bugs.python.org/issue29672. I haven't confirmed yet if the code within agg ever hits a separate catch_warnings; if it does that will cause repeat messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still going to verify that the use of another catch_warnings (in a different part of the code) is the cause for the multiple messages I am seeing without the special handling in this PR.

Copy link
Member Author

@rhshadrach rhshadrach Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed it is because of the Python bug mentioned above. The code in the sample hits pandas.core.indexes.base.Index_with_infer which uses the catch_warnings context manager. Here is a MWE to demonstrate the bug (mirroring the structure of agg_list_like):

def foo():
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", ".*the Index constructor", FutureWarning)


def bar(ndim):
    if ndim == 1:
        pass
    else:
        for _ in range(3):
            bar(1)

    warnings.warn('test warning', FutureWarning)
    foo()

bar(2)

With the call to foo from within bar, three warning messages are printed. When the call to foo is removed, only one is printed.

With this, are we okay with catching the warnings as was merged @jbrockmendel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im fine with it. a comment to this effect might be worthwhile

failed_names.append(col)
except (TypeError, DataError):
pass
failed_names.append(col)
except ValueError as err:
# cannot aggregate
if "Must produce aggregated value" in str(err):
# raised directly in _aggregate_named
pass
failed_names.append(col)
elif "no results" in str(err):
# reached in test_frame_apply.test_nuiscance_columns
# where the colg.aggregate(arg) ends up going through
# the selected_obj.ndim == 1 branch above with arg == ["sum"]
# on a datetime64[ns] column
pass
failed_names.append(col)
else:
raise
else:
Expand All @@ -384,6 +389,15 @@ def agg_list_like(self) -> DataFrame | Series:
if not len(results):
raise ValueError("no results")

if len(failed_names) > 0:
warnings.warn(
f"{failed_names} did not aggregate successfully. If any error is "
f"raised this will raise in a future version of pandas. "
f"Drop these columns/ops to avoid this warning.",
FutureWarning,
stacklevel=find_stack_level(),
)

try:
concatenated = concat(results, keys=keys, axis=1, sort=False)
except TypeError as err:
Expand Down
18 changes: 14 additions & 4 deletions pandas/tests/apply/test_frame_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -1087,12 +1087,16 @@ def test_agg_multiple_mixed_no_warning():
index=["min", "sum"],
)
# sorted index
with tm.assert_produces_warning(None):
with tm.assert_produces_warning(
FutureWarning, match=r"\['D'\] did not aggregate successfully"
):
result = mdf.agg(["min", "sum"])

tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(None):
with tm.assert_produces_warning(
FutureWarning, match=r"\['D'\] did not aggregate successfully"
):
result = mdf[["D", "C", "B", "A"]].agg(["sum", "min"])

# GH40420: the result of .agg should have an index that is sorted
Expand Down Expand Up @@ -1201,7 +1205,10 @@ def test_nuiscance_columns():
expected = Series([6, 6.0, "foobarbaz"], index=["A", "B", "C"])
tm.assert_series_equal(result, expected)

result = df.agg(["sum"])
with tm.assert_produces_warning(
FutureWarning, match=r"\['D'\] did not aggregate successfully"
):
result = df.agg(["sum"])
expected = DataFrame(
[[6, 6.0, "foobarbaz"]], index=["sum"], columns=["A", "B", "C"]
)
Expand Down Expand Up @@ -1433,7 +1440,10 @@ def foo(s):
return s.sum() / 2

aggs = ["sum", foo, "count", "min"]
result = df.agg(aggs)
with tm.assert_produces_warning(
FutureWarning, match=r"\['item'\] did not aggregate successfully"
):
result = df.agg(aggs)
expected = DataFrame(
{
"item": ["123456", np.nan, 6, "1"],
Expand Down
10 changes: 8 additions & 2 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,8 +339,14 @@ def test_multiple_functions_tuples_and_non_tuples(df):
expected = df.groupby("A")["C"].agg(ex_funcs)
tm.assert_frame_equal(result, expected)

result = df.groupby("A").agg(funcs)
expected = df.groupby("A").agg(ex_funcs)
with tm.assert_produces_warning(
FutureWarning, match=r"\['B'\] did not aggregate successfully"
):
result = df.groupby("A").agg(funcs)
with tm.assert_produces_warning(
FutureWarning, match=r"\['B'\] did not aggregate successfully"
):
expected = df.groupby("A").agg(ex_funcs)
tm.assert_frame_equal(result, expected)


Expand Down
6 changes: 4 additions & 2 deletions pandas/tests/groupby/aggregate/test_other.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,15 @@ def peak_to_peak(arr):
return arr.max() - arr.min()

with tm.assert_produces_warning(
FutureWarning, match="Dropping invalid", check_stacklevel=False
FutureWarning,
match=r"\['key2'\] did not aggregate successfully",
):
expected = grouped.agg([peak_to_peak])
expected.columns = ["data1", "data2"]

with tm.assert_produces_warning(
FutureWarning, match="Dropping invalid", check_stacklevel=False
FutureWarning,
match=r"\['key2'\] did not aggregate successfully",
):
result = grouped.agg(peak_to_peak)
tm.assert_frame_equal(result, expected)
Expand Down
5 changes: 4 additions & 1 deletion pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -583,7 +583,10 @@ def test_frame_multi_key_function_list():

grouped = data.groupby(["A", "B"])
funcs = [np.mean, np.std]
agged = grouped.agg(funcs)
with tm.assert_produces_warning(
FutureWarning, match=r"\['C'\] did not aggregate successfully"
):
agged = grouped.agg(funcs)
expected = pd.concat(
[grouped["D"].agg(funcs), grouped["E"].agg(funcs), grouped["F"].agg(funcs)],
keys=["D", "E", "F"],
Expand Down
10 changes: 6 additions & 4 deletions pandas/tests/resample/test_resample_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,10 +350,12 @@ def test_agg():
expected = pd.concat([a_mean, a_std, b_mean, b_std], axis=1)
expected.columns = pd.MultiIndex.from_product([["A", "B"], ["mean", "std"]])
for t in cases:
with tm.assert_produces_warning(None):
# .var on dt64 column raises and is dropped, but the path in core.apply
# that it goes through will still suppress a TypeError even
# once the deprecations in the groupby code are enforced
warn = FutureWarning if t in cases[1:3] else None
with tm.assert_produces_warning(
warn,
match=r"\['date'\] did not aggregate successfully",
):
# .var on dt64 column raises and is dropped
result = t.aggregate([np.mean, np.std])
tm.assert_frame_equal(result, expected)

Expand Down