Faster groupby! #179

eriknw · 2014-05-10T13:36:24Z

Issue #178 impressed upon me just how costly attribute resolution can be. In this case, groupby was made faster by avoiding resolving the attribute list.append.

This implementation is also more memory efficient than the current version that uses a defaultdict that gets cast to a dict. While casting a defaultdict d to a dict as dict(d) is fast, it is still a fast copy.

Honorable mention goes to the following implementation:

def groupby_alt(func, seq):
    d = collections.defaultdict(lambda: [].append)
    for item in seq:
        d[func(item)](item)
    rv = {}
    for k, v in iteritems(d):
        rv[k] = v.__self__
    return rv

This alternative implementation can at times be very impressive. You should play with it!

Issue pytoolz#178 impressed upon me just how costly attribute resolution can be. In this case, `groupby` was made faster by avoiding resolving the attribute `list.append`. This implementation is also more memory efficient than the current version that uses a `defaultdict` that gets cast to a `dict`. While casting a defaultdict `d` to a dict as `dict(d)` is fast, it is still a fast *copy*. Honorable mention goes to the following implementation: ```python def groupby_alt(func, seq): d = collections.defaultdict(lambda: [].append) for item in seq: d[func(item)](item) rv = {} for k, v in iteritems(d): rv[k] = v.__self__ return rv ``` This alternative implementation can at times be *very* impressive. You should play with it!

mrocklin · 2014-05-10T14:10:32Z

toolz/itertoolz.py

-        d[func(item)].append(item)
-    return dict(d)
+        key = func(item)
+        if key not in d:


Using try-except here might be a bit faster in cases with more than a few repeats. I get only modest improvements on my tiny benchmark though.

The groupby_alt that I posted in the above always performs better than using a try-except here. You should really test groupby_alt too!

I get only modest improvements on my tiny benchmark though.

Yeah, the improvements can indeed be modest, but they are most significant when many items are the same. Also, the new implementation is always better than the old implementation for my benchmarks, including all the same, all different, and tiny seq.

mrocklin · 2014-05-10T14:13:43Z

This is really cool stuff. I'm a bit concerned that we've completely left understandable Python, but for operations like groupby I think it's clearly worth it. Caching method dispatch is also a nice trick up our sleeve.

Is this class of optimizations useful at all in Cython?

mrocklin · 2014-05-10T14:14:29Z

A few people I run into know that defaultdict is fast. I don't think I've ever met someone who has gone this far :)

eriknw · 2014-05-10T14:54:25Z

A few people I run into know that defaultdict is fast. I don't think I've ever met someone who has gone this far :)

Ha! Well, you laid down the gauntlet a couple times in blog posts and messages saying something to the effect "we believe this is the fastest pure Python solution available." Although not nearly as clear or pythonic, a faster solution was found. On guard!

This is really cool stuff. I'm a bit concerned that we've completely left understandable Python, but for operations like groupby I think it's clearly worth it. Caching method dispatch is also a nice trick up our sleeve.

Yeah, agreed. Although this is clearly out of the ordinary--and some might say perverse--I don't think it's so obtuse that it can't be understood with a bit of effort. Perhaps I should add a couple code comments to make it easier to understand (i.e., this is faster because it avoids list.append attribute lookup, and v.__self__ refers to the list object associated with this append method).

Is this class of optimizations useful at all in Cython?

I have no idea. I really want to have a variational benchmarking framework before exploring such optimizations.

mrocklin · 2014-05-10T14:56:10Z

Maybe we should retain the idiomatic implementation as a comment

eriknw · 2014-05-10T14:58:06Z

Maybe we should retain the idiomatic implementation as a comment

Good idea. We also do this for merge_sorted when a key is defined.

Benchmarks have a tendency to use data that is pathologically perfect. In such pathological cases, it was unclear whether the current or previous implementation would be preferred. However, when the input data gets shuffled, the implementation in this commit is *clearly* superior. Therefore, I believe this is the best implementation for *real* data.

eriknw · 2014-05-10T23:03:01Z

By the way, @mrocklin, I made a slight tweak to the algorithm in groupby. I extended my test suite a little, including testing on data that wasn't pathologically perfect. I think the newest version is better, but I know it can be tricky to devise good benchmarks that reflect various data found in actual use. I'm curious how the new implementation performs on your data/benchmarks.

mrocklin · 2014-05-10T23:50:51Z

Hrm, one largeish benchmark would be to group up part of the web graph. Here is a 500MB chunk http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/network/part-r-00253.gz . Or use the following to get the whole thing (warning, a few hundred GB compressed).

wget -i http://webdatacommons.org/hyperlinkgraph/data/arc.list.txt

eriknw · 2014-05-11T13:14:04Z

Wow, the 500MB chunk is more like 4GB!

EDIT: There is a typo in groupby_cur below. We should revert back to groupby_prev.

Before I play with that, I suppose I should share my current benchmarks. Let's start with defining the functions:

import collections
import random
from toolz import identity
from toolz.compatibility import iteritems

def groupby_orig(func, seq):
    d = collections.defaultdict(list)
    for item in seq:
        d[func(item)].append(item)
    return dict(d)

def groupby_alt(func, seq):
    d = collections.defaultdict(lambda: [].append)
    for item in seq:
        d[func(item)](item)
    rv = {}
    for k, v in iteritems(d):
        rv[k] = v.__self__
    return rv

def groupby_prev(func, seq):
    d = {}
    for item in seq:
        key = func(item)
        if key not in d:
            d[key] = [item].append
        else:
            d[key](item)
    for k, v in iteritems(d):
        d[k] = v.__self__
    return d

def groupby_cur(func, seq):
    d = {}
    for item in seq:
        key = func(item)
        if key in d:
            d[key](item)
        else:
            d[key] = [].append  # XXX typo!
            #  d[key] = [item].append <-- should be this
    for k, v in iteritems(d):
        d[k] = v.__self__
    return d

Now lets use the perfectly behaved data. This covers all data is duplicated, all data is unique, and many states in-between:

In [2]: data = range(10000) * 1

In [3]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 12.3 ms per loop

In [4]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 16 ms per loop

In [5]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 10.5 ms per loop

In [6]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 9.68 ms per loop

In [7]: data = range(10000/3) * 3

In [8]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 7.44 ms per loop

In [9]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 6.82 ms per loop

In [10]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 6.98 ms per loop

In [11]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 6.68 ms per loop

In [12]: data = range(1000) * 10

In [13]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 6.2 ms per loop

In [14]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 4.89 ms per loop

In [15]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 5.62 ms per loop

In [16]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 5.58 ms per loop

In [17]: data = range(100) * 100

In [18]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.49 ms per loop

In [19]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 3.89 ms per loop

In [20]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.8 ms per loop

In [21]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 4.9 ms per loop

In [22]: data = range(10) * 1000

In [23]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.31 ms per loop

In [24]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 3.56 ms per loop

In [25]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.61 ms per loop

In [26]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 4.67 ms per loop

In [27]: data = range(1) * 10000

In [28]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.12 ms per loop

In [29]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 3.4 ms per loop

In [30]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.47 ms per loop

In [31]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 4.52 ms per loop

I see some small variation in the results when re-running these in different IPython sessions. The posted results are pretty typical, but I would say that groupby_alt performed a bit better than usual.

Now I'm going to intentionally add variance to the results by shuffling the data:

In [32]: data = range(10000) * 1

In [33]: random.shuffle(data)

In [34]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 15.3 ms per loop

In [35]: %timeit groupby_alt(identity, data)
10 loops, best of 3: 19.4 ms per loop

In [36]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 13.1 ms per loop

In [37]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 11 ms per loop

In [38]: data = range(10000/3) * 3

In [39]: random.shuffle(data)

In [40]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 8.11 ms per loop

In [41]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 7.67 ms per loop

In [42]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 7.92 ms per loop

In [43]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 7.43 ms per loop

In [44]: data = range(1000) * 10

In [45]: random.shuffle(data)

In [46]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 6.56 ms per loop

In [47]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 5.41 ms per loop

In [48]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 6.11 ms per loop

In [49]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 6.07 ms per loop

In [50]: data = range(100) * 100

In [51]: random.shuffle(data)

In [52]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.61 ms per loop

In [53]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 3.99 ms per loop

In [54]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.87 ms per loop

In [55]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 5 ms per loop

In [56]: data = range(10) * 1000

In [57]: random.shuffle(data)

In [58]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.17 ms per loop

In [59]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 3.6 ms per loop

In [60]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.67 ms per loop

In [61]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 4.64 ms per loop

In [62]: data = range(1) * 10000

In [63]: random.shuffle(data)

In [64]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.21 ms per loop

In [65]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 3.5 ms per loop

In [66]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.32 ms per loop

In [67]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 4.53 ms per loop

In [68]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 4.59 ms per loop

It's pretty obvious that groupby_alt beats the pants off of all the others when there are many duplicates. However, it behaves poorly when there are only a few duplicates. I think it's important to behave well in both regimes.

Let's look at a completely arbitrary and artificial data set to show that performance in all regimes is important:

In [69]: data = range(2000) + range(2000, 3000) * 5 + range(3000, 3100) * 40

In [70]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 8.3 ms per loop

In [71]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 7.67 ms per loop

In [72]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 7.24 ms per loop

In [73]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 7.08 ms per loop

In [74]: random.shuffle(data)

In [75]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 9.1 ms per loop

In [76]: %timeit groupby_alt(identity, data)
100 loops, best of 3: 8.57 ms per loop

In [77]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 8.07 ms per loop

In [78]: %timeit groupby_cur(identity, data)
100 loops, best of 3: 7.64 ms per loop

This is why I prefer groupby_cur over groupby_alt.

eriknw · 2014-05-11T19:21:49Z

I'm getting really good results with a new implementation:

def groupby_new(func, seq):
    rv = {}
    d = {}
    for item in seq:
        key = func(item)
        if key in d:
            d[key](item)
        elif key not in rv:
            rv[key] = [item]
        else:
            val = d[key] = rv[key].append
            val(item)
    return rv

This performs great for groups that have a lot of items, great for groups that have a single item, and good enough for the worst case scenario when groups have two or three items each.

mrocklin · 2014-05-11T23:52:59Z

What is the intuition behind this approach? It isn't immediately obvious to me (and I haven't had the time yet to sit down and actually see what's going on.)

eriknw · 2014-05-12T02:24:03Z

What is the intuition behind this approach?

To get a sense of performance behavior and implementation rationale, it helps to compare against the version that was previously used in toolz. I'll get to that, but first lets go through the new implementation line by line:

def groupby_new(func, seq):
    rv = {}
    d = {}

rv is the dictionary that gets returned with lists as the values. Note that we don't need to cast a defaultdict to dict or perform a post-processing iteration to set dict values to the lists (as is currently done in this PR).

d is to improve asymptotic behavior when there are many items per group by avoiding the list.append attribute resolution. d only contains groups of size two or more.

Even though we use two dicts, the impact on memory usage is minimal. In the worst case scenario in which all groups have two items, the memory footprint of the containers--the dicts and lists, but not their contents--increases by about 25%.

    for item in seq:
        key = func(item)

Standard iteration.

        if key in d:
            d[key](item)

This is for optimal asymptotic performance as the groups get larger. Note that this avoids an attribute look-up; i.e., it doesn't do d[key].append(item).

        elif key not in rv:
            rv[key] = [item]

This allows for fast initialization of groups. The implementation that was previously in toolz was very fast at adding groups by doing a check as done above--nearly twice as fast as the current toolz implementation--but it was asymptotically slower. This shows how much room there is to improve when creating a new group. I worry about long tails and groups of only one item, so I want a version of groupby that is generally fast in all regimes.

        else:
            val = d[key] = rv[key].append
            val(item)

We are adding a second item to a group and adding list.append to d.

    return rv

No casting or post-filtering is necessary.

Up next I'll share benchmarks.

eriknw · 2014-05-12T03:02:25Z

Benchmarks! Sorry again for the long wall of numbers:

import collections
import random
from toolz.compatibility import iteritems

def groupby_orig(func, seq):
    """ Implementation currently in ``toolz``"""
    d = collections.defaultdict(list)
    for item in seq:
        d[func(item)].append(item)
    return dict(d)

def groupby_old(func, seq):
    """ Modified version of what was previously in ``toolz``"""
    d = {}
    for item in seq:
        key = func(item)
        if key not in d:
            d[key] = [item]
        else:
            d[key].append(item)
    return d

def groupby_prev(func, seq):
    """ First version from this PR"""
    d = {}
    for item in seq:
        key = func(item)
        if key not in d:
            d[key] = [item].append
        else:
            d[key](item)
    for k, v in iteritems(d):
        d[k] = v.__self__
    return d

def groupby_new(func, seq):
    """ Newest version in this PR (not yet pushed)"""
    rv = {}
    d = {}
    for item in seq:
        key = func(item)
        if key in d:
            d[key](item)
        elif key not in rv:
            rv[key] = [item]
        else:
            val = d[key] = rv[key].append
            val(item)
    return rv

In [2]: identity = lambda x: x

In [3]: data = range(10000) * 1

In [4]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 11.6 ms per loop

In [5]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.55 ms per loop

In [6]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 10.7 ms per loop

In [7]: %timeit groupby_new(identity, data)
100 loops, best of 3: 7.62 ms per loop

In [8]: data = range(10000/3) * 3

In [9]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 7.23 ms per loop

In [10]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.87 ms per loop

In [11]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 7.09 ms per loop

In [12]: %timeit groupby_new(identity, data)
100 loops, best of 3: 7.23 ms per loop

In [13]: data = range(1000) * 10

In [14]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 6.17 ms per loop

In [15]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.79 ms per loop

In [16]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 5.91 ms per loop

In [17]: %timeit groupby_new(identity, data)
100 loops, best of 3: 5.88 ms per loop

In [18]: data = range(100) * 100

In [19]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.46 ms per loop

In [20]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.49 ms per loop

In [21]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.98 ms per loop

In [22]: %timeit groupby_new(identity, data)
100 loops, best of 3: 4.88 ms per loop

In [23]: data = range(10) * 1000

In [24]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.21 ms per loop

In [25]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.06 ms per loop

In [26]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.85 ms per loop

In [27]: %timeit groupby_new(identity, data)
100 loops, best of 3: 4.55 ms per loop

In [28]: data = range(1) * 10000

In [29]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.27 ms per loop

In [30]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.11 ms per loop

In [31]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.5 ms per loop

In [32]: %timeit groupby_new(identity, data)
100 loops, best of 3: 4.56 ms per loop

and the same benchmarks as above but with data shuffled:

In [33]: data = range(10000) * 1

In [34]: random.shuffle(data)

In [35]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 13.8 ms per loop

In [36]: %timeit groupby_old(identity, data)
100 loops, best of 3: 7.78 ms per loop

In [37]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 13.4 ms per loop

In [38]: %timeit groupby_new(identity, data)
100 loops, best of 3: 8.87 ms per loop

In [39]: data = range(10000/3) * 3

In [40]: random.shuffle(data)

In [41]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 7.77 ms per loop

In [42]: %timeit groupby_old(identity, data)
100 loops, best of 3: 7.48 ms per loop

In [43]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 7.89 ms per loop

In [44]: %timeit groupby_new(identity, data)
100 loops, best of 3: 8.3 ms per loop

In [45]: data = range(1000) * 10

In [46]: random.shuffle(data)

In [47]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 6.53 ms per loop

In [48]: %timeit groupby_old(identity, data)
100 loops, best of 3: 7.13 ms per loop

In [49]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 6.18 ms per loop

In [50]: %timeit groupby_new(identity, data)
100 loops, best of 3: 6.28 ms per loop

In [51]: data = range(100) * 100

In [52]: random.shuffle(data)

In [53]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.49 ms per loop

In [54]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.37 ms per loop

In [55]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.94 ms per loop

In [56]: %timeit groupby_new(identity, data)
100 loops, best of 3: 4.95 ms per loop

In [57]: data = range(10) * 1000

In [58]: random.shuffle(data)

In [59]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.25 ms per loop

In [60]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.09 ms per loop

In [61]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.77 ms per loop

In [62]: %timeit groupby_new(identity, data)
100 loops, best of 3: 4.6 ms per loop

In [63]: data = range(1) * 10000

In [64]: random.shuffle(data)

In [65]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 5.17 ms per loop

In [66]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.03 ms per loop

In [67]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 4.55 ms per loop

In [68]: %timeit groupby_new(identity, data)
100 loops, best of 3: 4.61 ms per loop

Next is the benchmark I used in a previous post to test a few regimes of behavior in the same data set:

In [69]: data = range(2000) + range(2000, 3000) * 5 + range(3000, 3100) * 40

In [70]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 8.07 ms per loop

In [71]: %timeit groupby_old(identity, data)
100 loops, best of 3: 7.2 ms per loop

In [72]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 7.21 ms per loop

In [73]: %timeit groupby_new(identity, data)
100 loops, best of 3: 6.74 ms per loop

In [74]: random.shuffle(data)

In [75]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 8.65 ms per loop

In [76]: %timeit groupby_old(identity, data)
100 loops, best of 3: 7.86 ms per loop

In [77]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 7.96 ms per loop

In [78]: %timeit groupby_new(identity, data)
100 loops, best of 3: 7.49 ms per loop

We expect groupby_new to be slowest when groups have two or three items, because this is when list.append gets added to the extra dictionary. The benchmark below stresses this with somewhat more realistic data: 3000 items are contained in groups of length one, 3000 in groups of length two, and 3000 in groups of length three.

In [79]: data = range(3000) + range(3000, 4500) * 2 + range(4500, 5500) * 3

In [80]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 8.77 ms per loop

In [81]: %timeit groupby_old(identity, data)
100 loops, best of 3: 6.46 ms per loop

In [82]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 8.35 ms per loop

In [83]: %timeit groupby_new(identity, data)
100 loops, best of 3: 7.5 ms per loop

In [84]: random.shuffle(data)

In [85]: %timeit groupby_orig(identity, data)
100 loops, best of 3: 9.32 ms per loop

In [86]: %timeit groupby_old(identity, data)
100 loops, best of 3: 7.23 ms per loop

In [87]: %timeit groupby_prev(identity, data)
100 loops, best of 3: 9.35 ms per loop

In [88]: %timeit groupby_new(identity, data)
100 loops, best of 3: 8.46 ms per loop

As expected in the last test groupby_old performs the best, because it initializes groups the fastest. groupby_old actually performs pretty well throughout the tests.

Even though these benchmarks are artificial, subject to systematic bias, and probably don't accurately reflect your data set (or mine or his or ...), I think they do tell a consistent story. I know it's not a complete story, but the result I get from it are that groupby_new is probably the overall best followed by groupby_prev (both from this PR). Additionally, all of the above implementations perform excellently, and data sets exist for which each implementation is optimal. I have also learned it is difficult to thoroughly benchmark groupby (especially with a small number of artificial tests)!

The previous commit was wrong. It is slower than the commit before. There was a mistake in the code used for benchmarks. I believe the new implementation has optimal performance as groups get larger. It is also fast when creating a new group. It is slowest when each group has two or three items in it, but it is still fast enough so as to not impact the general performance of the algorithm. A note on size: using a second dict in the implementation doesn't add much memory. Let us consider the size used by all of the containers--dicts and lists--but not their contents. For the worst case scenario in which both dicts have two items (note that `fastdict` only has groups of length two or greater), memory usage is only increased by about 25% by having a second dict.

eriknw · 2014-05-15T02:51:42Z

I'm pretty much groupby-ed out, so let me try to concisely summarize the results of my investigations.

First, when 10% or more of the elements being grouped form groups of length one, then groupby_new (current version in this PR) is the best, and it also performs pretty well overall:

def groupby_new(func, seq):
    """ Newest version in this PR"""
    rv = {}
    fastdict = {}
    for item in seq:
        key = func(item)
        if key in fastdict:
            d[key](item)
        elif key not in rv:
            rv[key] = [item]
        else:
            val = fastdict[key] = rv[key].append
            val(item)
    return rv

Second, when the average group size is about five or greater--and virtually no groups are of size one--then groupby_alt (mentioned in the very first post and commit) is the best:

def groupby_alt(func, seq):
    d = collections.defaultdict(lambda: [].append)
    for item in seq:
        d[func(item)](item)
    rv = {}
    for k, v in iteritems(d):
        rv[k] = v.__self__
    return rv

Third, in-between the aforementioned regimes, the competition between variants is close, and other versions can become optimal for a short segment of "data space."

It is my opinion that groupby_new or groupby_alt should be used. Both dominate in their respective regime: groupby_new when there are many groups of length one (>10% of elements), and groupby_alt when all groups have many elements (average group size greater than 5, and fewer than 10% of elements form groups of length one). groupby_alt performs about a third faster than groupby_new asymptotically as all group sizes become larger. groupby_new performs about twice as fast as groupby_alt in the unlikely asymptotic regime that all groups are of length one.

To put into context, the original version from this PR, groupby_prev, falls solidly in-between the other two versions being considered. It doesn't perform quite as poorly or as well in either asymptotic regime, but still performs strongly overall. groupby_prev is actually better (barely) when group sizes are between two and five (endpoints included). Even so, I don't consider it as a "compromise" solution and I find it hard to argue for groupby_prev over the other two.

def groupby_prev(func, seq):
    """ First version from this PR"""
    d = {}
    for item in seq:
        key = func(item)
        if key not in d:
            d[key] = [item].append
        else:
            d[key](item)
    for k, v in iteritems(d):
        d[k] = v.__self__
    return d

@mrocklin, do you have a preference? How long are the tails in your data? Do you want to optimize groupby for when 10% or more of elements form groups of one, or for when such groups are unlikely and groups become large? We're talking about a potential difference of 30-35% in performance.

eriknw · 2014-05-15T03:38:29Z

Oh, let me share how I arrived at the 10% rule of thumb. groupby_new and groupby_alt perform basically the same for the following two cases:

8% of elements form single groups and the rest form groups of length of about 5.
12% of elements form single groups and the rest form "very large" groups.

Also, benchtoolz is awesome!

mrocklin · 2014-05-15T04:15:11Z

I'm pretty much groupby-ed out, so let me try to concisely summarize the results of my investigations.

Sounds then like we should make a decision and merge this.

I don't have a good understanding on the distribution of datasets that will be used with groupby. I know my own experience, but that's a pretty sparse sampling. Some thoughts:

Historically I've found myself optimizing for the large groups case. Presumably this was driven by my data needs at the time.
People talk a lot about long tailed distributions. Presumably common natural datasets have both lots of groups of size one and a few groups that contain most of the elements.
There is also a large fraction of datasets that are very regular, think grouping stock transactions by ticker symbol. This is the birth-application of Pandas and is perhaps motivating.

If you want me to make an arbitrary decision I'm happy to do so. I definitely trust your intuition here more than mine. Probably we want something that's somewhat robust.

Another thought is to put a couple implementations into toolz.itertoolz but not include them in toolz.__all__. We could then define the groupby term to one or the other as we learn more.

Let me know if you want me to make an arbitrary decision.

eriknw · 2014-05-15T15:09:41Z

Let me know if you want me to make an arbitrary decision.

Yes, please do :)

But, let me share my current bias first. I am leaning slightly towards groupby_alt, which performs the best when groups are large and single element groups are few. Here is my reasoning:

I find it more likely to have fewer--not greater--than 10% of elements form groups of size one (especially in cases where performance matters most).
- much fewer than 10% is also more likely than much greater in my opinion.
- If the cutoff were 5%, I would lean towards groupby_prev.
The tail for which groupby_prev is efficient is too narrowly defined as groups of length one.
- Tails often include groups of length two or three or greater, but both versions being considered perform about the same in these cases (and other variants perform even better).
The implementation of groupby_alt is easier to understand.

An argument for groupby_prev is that it is more versatile--it doesn't behave poorly for any pathological data--and is likely significantly faster than groupby_alt when performance is dominated by many small groupby operations.

Previous revisions of groupby considered the performance trade-offs of group creation and appending to groups. Now we are considering even more: the distribution of group sizes. This is more challenging.

Another thought is to put a couple implementations into toolz.itertoolz

Not sure about this. I have six implementations of groupby to add to benchmarkz (do you prefer a different name?), which will run benchmarks using benchtoolz. This will serve as a record of variations that have been tried, and allows contributors to add benchmarks that are important to them. As the benchmark suite evolves, we can select a different version to use in toolz.

@mrocklin, decision time!

mrocklin · 2014-05-15T15:25:04Z

All things being equal, reason 3 compels me. Lets go with alt.

I like the idea of the groupby implementations living in a separate repo that we can refer to.

This is `groupby_alt` from the original commit comment in this branch. See discussion at pytoolz#179. This version performs very well as groups become larger. The previous implementation performs well when 10% or more of the elements form groups of length one. We plan to have various implementations in `benchmarkz` repository, which will let contributors add benchmarks that they care about and easily run them on all variants of `groupby`.

eriknw · 2014-05-15T17:14:58Z

Done. I thought that might be the deciding factor :)

mrocklin · 2014-05-18T18:14:49Z

I think this is the longest discussion to lines-changed ratio PR I've ever seen. Merging in a bit if no comments.

eriknw · 2014-05-18T19:21:10Z

I think this is the longest discussion to lines-changed ratio PR I've ever seen.

Indeed! There were some interesting discoveries and discussions though. I'm eager to test the variations of groupby using benchmarks from other projects, which will become easy to do once benchtoolz and benchmarkz are ready to be shared.

+1 to merge.

Faster groupby!

mrocklin reviewed May 10, 2014
View reviewed changes

eriknw mentioned this pull request May 10, 2014

Faster unique, isdistinct, merge_sorted, and sliding_window. #178

Merged

eriknw added 2 commits May 10, 2014 12:59

Add useful code comments to groupby including pythonic implementation

76054c8

eriknw added 2 commits May 11, 2014 23:10

No need to import compatibility.iteritems in itertoolz

d871c94

mrocklin added a commit that referenced this pull request May 18, 2014

Merge pull request #179 from eriknw/faster_groupby

e6a043f

Faster groupby!

mrocklin merged commit e6a043f into pytoolz:master May 18, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster groupby! #179

Faster groupby! #179

eriknw commented May 10, 2014

mrocklin May 10, 2014

eriknw May 10, 2014

mrocklin commented May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 10, 2014

eriknw commented May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 11, 2014

eriknw commented May 11, 2014

mrocklin commented May 11, 2014

eriknw commented May 12, 2014

eriknw commented May 12, 2014

eriknw commented May 15, 2014

eriknw commented May 15, 2014

mrocklin commented May 15, 2014

eriknw commented May 15, 2014

mrocklin commented May 15, 2014

eriknw commented May 15, 2014

mrocklin commented May 18, 2014

eriknw commented May 18, 2014

Faster groupby! #179

Faster groupby! #179

Conversation

eriknw commented May 10, 2014

mrocklin May 10, 2014

Choose a reason for hiding this comment

eriknw May 10, 2014

Choose a reason for hiding this comment

mrocklin commented May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 10, 2014

eriknw commented May 10, 2014

mrocklin commented May 10, 2014

eriknw commented May 11, 2014

eriknw commented May 11, 2014

mrocklin commented May 11, 2014

eriknw commented May 12, 2014

eriknw commented May 12, 2014

eriknw commented May 15, 2014

eriknw commented May 15, 2014

mrocklin commented May 15, 2014

eriknw commented May 15, 2014

mrocklin commented May 15, 2014

eriknw commented May 15, 2014

mrocklin commented May 18, 2014

eriknw commented May 18, 2014