Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation on empty table should be allowed in some cases #785

Closed
emillynge opened this issue Apr 1, 2020 · 1 comment
Closed

Aggregation on empty table should be allowed in some cases #785

emillynge opened this issue Apr 1, 2020 · 1 comment

Comments

@emillynge
Copy link
Contributor

emillynge commented Apr 1, 2020

Dear Maintainers

I have found, that if I use summarize on a table without any rows, an IllegalArgumentException will be thrown in tech.tablesaw.table.TableSliceGroup.aggregate(ListMultimap<String, AggregateFunction<?, ?>> functions)

I understand that something might break in the function if this check is not performed, but it seems odd to me that the function doesn't just return another empty table. Rather than throwing an error, why not have a check somewhere else that short-circuits the aggregation to always return 0 rows?

If this project is aiming for consistency with pandas for example, then the corresponding behavior would be:

>>> import pandas
>>> df = pandas.DataFrame(columns=["Hello", "World"])
>>> df
Empty DataFrame
Columns: [Hello, World]
Index: []
>>> df.groupby("Hello").sum()
Empty DataFrame
Columns: [World]
Index: []

This may need to be per-aggregation type since pandas will raise en exception in certain cases:

>>> df.groupby("Hello").std()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    df.groupby("Hello").std()
  File "/usr/lib/python3/dist-packages/pandas/core/groupby/groupby.py", line 135
1, in std
    return np.sqrt(self.var(ddof=ddof, **kwargs))
  File "/usr/lib/python3/dist-packages/pandas/core/groupby/groupby.py", line 136
8, in var
    return self._cython_agg_general('var', **kwargs)
  File "/usr/lib/python3/dist-packages/pandas/core/groupby/groupby.py", line 397
4, in _cython_agg_general
    how, alt=alt, numeric_only=numeric_only, min_count=min_count)
  File "/usr/lib/python3/dist-packages/pandas/core/groupby/groupby.py", line 404
6, in _cython_agg_blocks
    raise DataError('No numeric types to aggregate')
pandas.core.base.DataError: No numeric types to aggregate

Changing behaviour in this way should be backward compatible since users atm. is already checking for, and handling table.rowCount() == 0 before summarizing.

I would not expect anyone to depend on summarize() throwing an exception.

If there is support for such a change, I will write a PR.

@benmccann
Copy link
Collaborator

I'm not as familiar with the summarization and aggregation as @lwhite1 and @ryancerf, but returning an empty table probably makes more sense to me than throwing an exception, so I'd be open to a PR to change it. It's not a goal of ours to match what Pandas does exactly, but I don't mind taking some inspiration from them and doing what they do when it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants