-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster unique, isdistinct, merge_sorted, and sliding_window. #178
Conversation
The `key` keyword argument to `unique` was changed from `identity` to `None`. This better matches API elsewhere, and lets us remove `identity` from being redefined in `itertoolz`, which always seemed a little weird. Most of the speed improvements come from avoiding attribute resolution in frequently run code. Attribute resolution (i.e., the "dot" operator) is probably more costly than one would expect. Fortunately, there weren't many places to apply this optimization, so impact on code readability was minimal. `unique` employs another optimization: branching by `key is None` outside the loop (thus requiring two loops). While this violates the DRY principle (and, hence, I would prefer not to do it in general), this is only a few lines of code that remain side-by-side, and the performance increase is worth it. `merge_sorted` is now optimized when only a single iterable remains. This makes it *so* much faster while in this condition.
Issue pytoolz#178 impressed upon me just how costly attribute resolution can be. In this case, `groupby` was made faster by avoiding resolving the attribute `list.append`. This implementation is also more memory efficient than the current version that uses a `defaultdict` that gets cast to a `dict`. While casting a defaultdict `d` to a dict as `dict(d)` is fast, it is still a fast *copy*. Honorable mention goes to the following implementation: ```python def groupby_alt(func, seq): d = collections.defaultdict(lambda: [].append) for item in seq: d[func(item)](item) rv = {} for k, v in iteritems(d): rv[k] = v.__self__ return rv ``` This alternative implementation can at times be *very* impressive. You should play with it!
@@ -120,7 +117,9 @@ def _merge_sorted_key(seqs, key): | |||
heapq.heapify(pq) | |||
|
|||
# Repeatedly yield and then repopulate from the same iterator | |||
while True: | |||
heapreplace = heapq.heapreplace | |||
heappop = heapq.heappop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man, I never would have thought of this.
Do you have micro benchmarks to back up the value of these changes? |
You bet. The following are variations of from toolz import unique # original implementation, not from this PR
from cytoolz import unique as cyunique
def unique1(seq, key=None):
seen = set()
no_key = key is None
for item in seq:
val = item if no_key else key(item)
if val not in seen:
seen.add(val)
yield item
def unique2(seq, key=None):
seen = set()
seen_add = seen.add
for item in seq:
val = item if key is None else key(item)
if val not in seen:
seen_add(val)
yield item
def unique3(seq, key=None):
seen = set()
seen_add = seen.add
no_key = key is None
for item in seq:
val = item if no_key else key(item)
if val not in seen:
seen_add(val)
yield item
def unique4(seq, key=None):
seen = set()
seen_add = seen.add
if key is None:
for item in seq:
if item not in seen:
seen_add(item)
yield item
else:
for item in seq:
val = key(item)
if val not in seen:
seen_add(val)
yield item These are ordered from slowest to fastest. Now the benchmarks: In [11]: L = range(1000)
In [12]: %timeit list(unique(L))
1000 loops, best of 3: 664 µs per loop
In [13]: %timeit list(unique1(L))
1000 loops, best of 3: 583 µs per loop
In [14]: %timeit list(unique2(L))
1000 loops, best of 3: 403 µs per loop
In [15]: %timeit list(unique3(L))
1000 loops, best of 3: 378 µs per loop
In [16]: %timeit list(unique4(L))
1000 loops, best of 3: 333 µs per loop
In [17]: %timeit list(cyunique(L))
10000 loops, best of 3: 131 µs per loop
In [18]: L = [1] * 1000
In [19]: %timeit list(unique(L))
1000 loops, best of 3: 308 µs per loop
In [20]: %timeit list(unique1(L))
10000 loops, best of 3: 136 µs per loop
In [21]: %timeit list(unique2(L))
10000 loops, best of 3: 198 µs per loop
In [22]: %timeit list(unique3(L))
10000 loops, best of 3: 136 µs per loop
In [23]: %timeit list(unique4(L))
10000 loops, best of 3: 95 µs per loop
In [24]: %timeit list(cyunique(L))
10000 loops, best of 3: 51.1 µs per loop |
Wow, that's very impressive. |
Indeed, which is why I was compelled to try something as perverse as #179 ! |
On the topic of avoiding attribute resolution, another place to apply this optimization is importing. For example, |
Want to see something awesome? Running Benchmarks: benchmarkz/bench_unique.py Time:
Relative time:
Rank:
Here is the full output (note that the first half is from "verbose=True" during benchmarking, and the second half is output controlled by the user): Using benchmark file:
benchmarkz/bench_unique.py
Using arena file:
toolz_arena/unique.py
bench_all_different
590 usec - unique0 - (2^9 = 512 loops)
466 usec - unique1 - (2^10 = 1024 loops)
400 usec - unique2 - (2^10 = 1024 loops)
351 usec - unique3 - (2^10 = 1024 loops)
306 usec - unique4 - (2^10 = 1024 loops)
bench_all_same
305 usec - unique0 - (2^10 = 1024 loops)
135 usec - unique1 - (2^12 = 4096 loops)
197 usec - unique2 - (2^11 = 2048 loops)
135 usec - unique3 - (2^12 = 4096 loops)
93.4 usec - unique4 - (2^12 = 4096 loops)
bench_tiny
2.8 usec - unique0 - (2^17 = 131072 loops)
2.77 usec - unique1 - (2^17 = 131072 loops)
2.72 usec - unique2 - (2^17 = 131072 loops)
2.83 usec - unique3 - (2^17 = 131072 loops)
2.64 usec - unique4 - (2^17 = 131072 loops)
**Benchmarks:** benchmarkz/bench_unique.py
**Functions:** toolz_arena/unique.py
**Time:**
| **Bench** \ **Func** | **0** | **1** | **2** | **3** | **4** |
| ------------------------:|:-----:|:-----:|:-----:|:-----:|:--------:|
| **all_different** (`us`) | 590 | 466 | 400 | 351 | **306** |
| **all_same** (`us`) | 305 | 135 | 197 | 135 | **93.4** |
| **tiny** (`us`) | 2.8 | 2.77 | 2.72 | 2.83 | **2.64** |
**Relative time:**
|**Bench** \ **Func** | **0** | **1** | **2** | **3** | **4** |
| -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|
| **all_different** | 1.93 | 1.52 | 1.31 | 1.15 | **1** |
| **all_same** | 3.27 | 1.45 | 2.11 | 1.45 | **1** |
| **tiny** | 1.06 | 1.05 | 1.03 | 1.07 | **1** |
**Rank:**
|**Bench** \ **Func** | **0** | **1** | **2** | **3** | **4** |
| -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|
| **all_different** | 5 | 4 | 3 | 2 | **1** |
| **all_same** | 5 | 3 | 4 | 2 | **1** |
| **tiny** | 4 | 3 | 2 | 5 | **1** | The files "benchmarkz/bench_unique.py" and "toolz_arena/unique.py" really are as simple as one would hope. "benchmarkz/bench_unique.py" : from toolz import unique
all_different = list(range(1000))
all_same = [1] * 1000
tiny = [1]
def bench_all_different():
list(unique(all_different))
def bench_all_same():
list(unique(all_same))
def bench_tiny():
list(unique(tiny)) The first few lines of "toolz_arena/unique.py": def identity(x):
return x
def unique0(seq, key=identity):
seen = set()
for item in seq:
val = key(item)
if val not in seen:
seen.add(val)
yield item I'll push this code to github soon. |
This looks amazing. Is it a standalone project?
|
It sure is! Below shows a basic "runbench.py" file. By convention, we look for "benchmark" and "arena" directories in the same directory as "runbench.py", but other paths may be used instead via keyword arguments. Searching for benchmarks and functions to run in those benchmarks doesn't import (and, hence, run) any external Python code, and the user will have a chance to review these and remove or add any files or functions of their choosing after a from benchtoolz import BenchFinder, BenchRunner, BenchPrinter
if __name__ = '__main__':
benchfinder = BenchFinder(name, cython=False)
benchrunner = BenchRunner(benchfinder)
results = benchrunner.run()
benchprinter = BenchPrinter(results)
# perhaps we should provide a less ugly way to do this...
for (benchfile, arenafile), table in sorted(benchprinter.tables.items()):
gfm_times = benchprinter.to_gfm(table)
gfm_reltimes = benchprinter.to_gfm(table, relative=True)
gfm_rank = benchprinter.to_gfm(table, rank=True)
# print stuff
... |
can i haz it? |
Maybe even just seeing the code up on |
Is this ready to go in? |
Yeah, I think it is. |
Faster unique, isdistinct, merge_sorted, and sliding_window.
The
key
keyword argument tounique
was changed fromidentity
toNone
. This better matches API elsewhere, and lets us removeidentity
from being redefined initertoolz
, which always seemed a little weird.Most of the speed improvements come from avoiding attribute resolution in frequently run code. Attribute resolution (i.e., the "dot" operator) is probably more costly than one would expect. Fortunately, there weren't many places to apply this optimization, so impact on code readability was minimal.
unique
employs another optimization: branching bykey is None
outside the loop (thus requiring two loops). While this violates the DRY principle (and, hence, I would prefer not to do it in general), this is only a few lines of code that remain side-by-side, and the performance increase is worth it.merge_sorted
is now optimized when only a single iterable remains. This makes it so much faster while in this condition.