PERF: improve iloc list indexing #15504

jorisvandenbossche · 2017-02-25T17:20:11Z

Did a profile of iloc using a list indexer, and noticed a few parts that seemingly can be optimized rather easily:

convert early to array, and keep as array
do not validate the input for out of bounds (this validation is also (and much faster) done in np.take)
do not validate kwargs (for numpy compat) (-> could define a _take method that does not do this for internal usage?)
use fastpath for resulting Series creation

The failing tests are currently just the ones that checked for the kwarg validation (and one failing test due to the fastpath taken where object datetimes are not coerced to datetime64).

Compared to master, I get up to 40 to 60% improvement on some simple tests:

PR:

In [1]: s = pd.Series(np.random.randn(10000))

In [2]: s.iloc[10]
Out[2]: 0.91268983379322421

In [3]: %timeit s.iloc[[10]]
The slowest run took 14.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 47.3 µs per loop

In [4]: indexer = list(range(10, 30)) + list(range(50,80))

In [5]: %timeit s.iloc[indexer]
The slowest run took 5.92 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 49.8 µs per loop

In [6]: indexer = list(range(10, 30)) + list(range(50,800))

In [7]: %timeit s.iloc[indexer]
10000 loops, best of 3: 104 µs per loop

master:

In [1]: s = pd.Series(np.random.randn(10000))

In [2]: s.iloc[10]
Out[2]: -1.4152272792497846

In [3]: %timeit s.iloc[[10]]
The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 103 µs per loop

In [4]: indexer = list(range(10, 30)) + list(range(50,80))

In [5]: %timeit s.iloc[indexer]
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 131 µs per loop

In [6]: indexer = list(range(10, 30)) + list(range(50,800))

In [7]: %timeit s.iloc[indexer]
The slowest run took 17.01 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 247 µs per loop

There are some details to work out, but already putting it up here for discussion (and need to run the benchmark suite).

From the remaining time that iloc takes, a large part is spent in the creation of the actual result (the new series): creating a new index (~ 17%), creating a new series based on the values and index (~ 32%). So in total almost half of the time, and thus wondering if this is not possible to speedup.

jorisvandenbossche · 2017-03-29T09:43:36Z

@jreback Can you take a look at this?
@gfyoung do you have an opinion about the kwarg validation in take? In this case, this is always unnecessary, and thus adds overhead that could be avoided (one option is to have a _take that does not do this for internal use, but this makes the code more ugly)

gfyoung · 2017-03-29T09:48:11Z

@jorisvandenbossche : It's not ideal, but given that users have had issues with this in the past, I would prefer that you keep that in the code than remove it. If you leave the check, what is the speedup?

jorisvandenbossche · 2017-03-29T09:59:47Z

@gfyoung It's not the most important part of the speed-up:

In [53]: from pandas.compat.numpy import function as nv

In [54]: nv.validate_take(tuple(), dict())

In [55]: %timeit nv.validate_take(tuple(), dict())
The slowest run took 9.40 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.25 µs per loop

but it is called several times. In this case I think twice, which gives ca 5µs on the total of ca 50µs (for one of the examples above). So certainly not huge, but for performance sensitive code, a significant part for something that is not necessary (in this case).

Other option (instead of the _take), is to speed-up the actual validation function for those cases:

In [61]: def validate_take(args, kwargs):
    ...:     if args or kwargs:
    ...:         nv.validate_take(args, kwargs)
    ...:         

In [62]: validate_take(tuple(), dict())

In [63]: %timeit validate_take(tuple(), dict())
The slowest run took 6.01 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 512 ns per loop

In [65]: %timeit validate_take(None, dict())
The slowest run took 12.67 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 322 ns per loop

or, could also do this in the take method itself:

def take(self, ..., **kwargs):
    if kwargs:
        nv.validate_take(tuple(), kwargs)
    ...

gfyoung · 2017-03-29T10:03:50Z

@jorisvandenbossche : Pre-checking kwargs or args is certainly a possibility.

jreback · 2017-03-30T20:52:56Z

do we have micro benchmarks for this?

jreback · 2017-03-30T21:09:24Z

pandas/core/series.py

@@ -2378,7 +2378,8 @@ def take(self, indices, axis=0, convert=True, is_copy=False, **kwargs):
        --------
        numpy.ndarray.take
        """
-        nv.validate_take(tuple(), kwargs)
+        if kwargs:


assume this is going to go back after #15850 ?

We can leave it in even then (for performance sensitive methods like take here). It is still 10x faster even after #15850 (although 10x is not that important anymore at those low numbers).

ahh ok sure.

jorisvandenbossche · 2017-03-30T22:08:18Z

do we have micro benchmarks for this?

We actually already do have some, but I should run them (at the top are only some basic %timeit timings)

jorisvandenbossche · 2017-03-31T07:19:23Z

So the remaining test failure is related to the following:

In [1]: import pandas.util.testing as tm

In [3]: s = tm.makeObjectSeries()

In [4]: s
Out[4]: 
MgrcA7IMuH    2000-01-03 00:00:00
QbbI4vogXZ    2000-01-04 00:00:00
f37URcolZJ    2000-01-05 00:00:00
...
aQWgxosz9u    2000-02-09 00:00:00
HnXFRuUGov    2000-02-10 00:00:00
2SA0hNdHwt    2000-02-11 00:00:00
dtype: object

In [5]: s.values
Out[5]: 
array([datetime.datetime(2000, 1, 3, 0, 0),
       datetime.datetime(2000, 1, 4, 0, 0),
       datetime.datetime(2000, 1, 5, 0, 0),
...
       datetime.datetime(2000, 2, 9, 0, 0),
       datetime.datetime(2000, 2, 10, 0, 0),
       datetime.datetime(2000, 2, 11, 0, 0)], dtype=object)

In [6]: s.take([1,3,8])
Out[6]: 
QbbI4vogXZ    2000-01-04 00:00:00
zUVaKnBXUH    2000-01-06 00:00:00
TZT2OzuB7y    2000-01-13 00:00:00
dtype: object

In [7]: s.take([1,3,8]).values
Out[7]: 
array([datetime.datetime(2000, 1, 4, 0, 0),
       datetime.datetime(2000, 1, 6, 0, 0),
       datetime.datetime(2000, 1, 13, 0, 0)], dtype=object)

So in the above (this is the behaviour of this PR), after take this preserves the object dtype with datetime.datetime values (I suppose due to using the fastpath in series creation). While the test expects a datetime64 result.
Personally, I actually find the above behaviour more preferable, as simple positional indexing should not necessarily change the dtype. But of course we should agree on that changing the test is OK here (and pandas currently does/assumes such inference on many places, so it has possibly larger implications)

I ran the indexing benchmarks:

    before     after       ratio
  [1f890607] [6d2705cd]
+    8.90μs    16.21μs      1.82  period.period_standard_indexing.time_shallow_copy
+  240.70μs   364.02μs      1.51  indexing.DataFrameIndexing.time_iloc_dups
-   62.26ms    51.08ms      0.82  indexing.Int64Indexing.time_getitem_list_like
-    9.44μs     5.02μs      0.53  indexing.DataFrameIndexing.time_get_value_ix
-   76.00μs    38.18μs      0.50  indexing.Int64Indexing.time_iloc_list_like
-    1.76ms   136.69μs      0.08  indexing.Int64Indexing.time_iloc_array
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

So the iloc list_like and array improve as expected (the array case even more than I expected). The reason that the period shallow copy is much slower is not really clear to me (I don't think I touched code related to shallow_copy). But when I reran the benchmarks, both those tests with slowdown were not consistent slower, so probably noise.

jreback · 2017-03-31T12:53:34Z

@jorisvandenbossche yeah the fastpath=True does not re-infer things. So that seems to be more correct. This just highlites that preserving numpy semantics is hard, while have easy-to-use pandas ones.

If you can separate out this specific case (to a separate) test (and put a reference to this issue) would be great.

After you merge this, I think I will create some pandas2 markers in pytest and start adding here.

jreback · 2017-04-01T15:14:47Z

@jorisvandenbossche I think travis was flaky, can you push again

jreback · 2017-04-01T15:48:59Z

lgtm. merge away.

jreback · 2017-04-01T21:51:45Z

thanks!

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 27, 2017

jorisvandenbossche added 2 commits March 30, 2017 22:40

PERF: improve iloc list indexing

aacbaa8

take method: only validate kwargs if there are kwargs

6d2705c

jorisvandenbossche force-pushed the perf-iloc-list branch from 6cbac2e to 6d2705c Compare March 30, 2017 20:51

jreback reviewed Mar 30, 2017

View reviewed changes

small clean-up

3e537b6

add whatsnew

74d45ae

jorisvandenbossche added this to the 0.20.0 milestone Mar 31, 2017

TST: edit test_take to preserve original dtype

bf54a0b

jorisvandenbossche force-pushed the perf-iloc-list branch from af7a7ad to bf54a0b Compare April 1, 2017 15:28

jreback closed this in a57e681 Apr 1, 2017

jorisvandenbossche deleted the perf-iloc-list branch October 16, 2017 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improve iloc list indexing #15504

PERF: improve iloc list indexing #15504

jorisvandenbossche commented Feb 25, 2017

jorisvandenbossche commented Mar 29, 2017

gfyoung commented Mar 29, 2017

jorisvandenbossche commented Mar 29, 2017

gfyoung commented Mar 29, 2017

jreback commented Mar 30, 2017

jreback Mar 30, 2017

jorisvandenbossche Mar 30, 2017

jreback Mar 30, 2017

jorisvandenbossche commented Mar 30, 2017

jorisvandenbossche commented Mar 31, 2017

jreback commented Mar 31, 2017

jreback commented Apr 1, 2017

jreback commented Apr 1, 2017

jreback commented Apr 1, 2017

PERF: improve iloc list indexing #15504

PERF: improve iloc list indexing #15504

Conversation

jorisvandenbossche commented Feb 25, 2017

jorisvandenbossche commented Mar 29, 2017

gfyoung commented Mar 29, 2017

jorisvandenbossche commented Mar 29, 2017

gfyoung commented Mar 29, 2017

jreback commented Mar 30, 2017

jreback Mar 30, 2017

Choose a reason for hiding this comment

jorisvandenbossche Mar 30, 2017

Choose a reason for hiding this comment

jreback Mar 30, 2017

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 30, 2017

jorisvandenbossche commented Mar 31, 2017

jreback commented Mar 31, 2017

jreback commented Apr 1, 2017

jreback commented Apr 1, 2017

jreback commented Apr 1, 2017