Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster groupby! #179

Merged
merged 6 commits into from
May 18, 2014
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions toolz/itertoolz.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
import collections
import operator
from functools import partial
from toolz.compatibility import map, filter, filterfalse, zip, zip_longest
from toolz.compatibility import (map, filter, filterfalse, zip, zip_longest,
iteritems)


__all__ = ('remove', 'accumulate', 'groupby', 'merge_sorted', 'interleave',
Expand Down Expand Up @@ -66,12 +67,19 @@ def groupby(func, seq):
{False: [1, 3, 5, 7], True: [2, 4, 6, 8]}

See Also:
``countby``
countby
"""
d = collections.defaultdict(list)
d = {}
for item in seq:
d[func(item)].append(item)
return dict(d)
key = func(item)
if key not in d:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using try-except here might be a bit faster in cases with more than a few repeats. I get only modest improvements on my tiny benchmark though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The groupby_alt that I posted in the above always performs better than using a try-except here. You should really test groupby_alt too!

I get only modest improvements on my tiny benchmark though.

Yeah, the improvements can indeed be modest, but they are most significant when many items are the same. Also, the new implementation is always better than the old implementation for my benchmarks, including all the same, all different, and tiny seq.

d[key] = [item].append
else:
d[key](item)
# This is okay to do, because we are not adding or removing keys
for k, v in iteritems(d):
d[k] = v.__self__
return d


def merge_sorted(*seqs, **kwargs):
Expand Down