Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improve iloc list indexing #15504

Closed

Conversation

jorisvandenbossche
Copy link
Member

Did a profile of iloc using a list indexer, and noticed a few parts that seemingly can be optimized rather easily:

  • convert early to array, and keep as array
  • do not validate the input for out of bounds (this validation is also (and much faster) done in np.take)
  • do not validate kwargs (for numpy compat) (-> could define a _take method that does not do this for internal usage?)
  • use fastpath for resulting Series creation

The failing tests are currently just the ones that checked for the kwarg validation (and one failing test due to the fastpath taken where object datetimes are not coerced to datetime64).

Compared to master, I get up to 40 to 60% improvement on some simple tests:

PR:

In [1]: s = pd.Series(np.random.randn(10000))

In [2]: s.iloc[10]
Out[2]: 0.91268983379322421

In [3]: %timeit s.iloc[[10]]
The slowest run took 14.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 47.3 µs per loop

In [4]: indexer = list(range(10, 30)) + list(range(50,80))

In [5]: %timeit s.iloc[indexer]
The slowest run took 5.92 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 49.8 µs per loop

In [6]: indexer = list(range(10, 30)) + list(range(50,800))

In [7]: %timeit s.iloc[indexer]
10000 loops, best of 3: 104 µs per loop

master:

In [1]: s = pd.Series(np.random.randn(10000))

In [2]: s.iloc[10]
Out[2]: -1.4152272792497846

In [3]: %timeit s.iloc[[10]]
The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 103 µs per loop

In [4]: indexer = list(range(10, 30)) + list(range(50,80))

In [5]: %timeit s.iloc[indexer]
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 131 µs per loop

In [6]: indexer = list(range(10, 30)) + list(range(50,800))

In [7]: %timeit s.iloc[indexer]
The slowest run took 17.01 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 247 µs per loop

There are some details to work out, but already putting it up here for discussion (and need to run the benchmark suite).

From the remaining time that iloc takes, a large part is spent in the creation of the actual result (the new series): creating a new index (~ 17%), creating a new series based on the values and index (~ 32%). So in total almost half of the time, and thus wondering if this is not possible to speedup.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 27, 2017
@jorisvandenbossche
Copy link
Member Author

@jreback Can you take a look at this?
@gfyoung do you have an opinion about the kwarg validation in take? In this case, this is always unnecessary, and thus adds overhead that could be avoided (one option is to have a _take that does not do this for internal use, but this makes the code more ugly)

@gfyoung
Copy link
Member

gfyoung commented Mar 29, 2017

@jorisvandenbossche : It's not ideal, but given that users have had issues with this in the past, I would prefer that you keep that in the code than remove it. If you leave the check, what is the speedup?

@jorisvandenbossche
Copy link
Member Author

@gfyoung It's not the most important part of the speed-up:

In [53]: from pandas.compat.numpy import function as nv

In [54]: nv.validate_take(tuple(), dict())

In [55]: %timeit nv.validate_take(tuple(), dict())
The slowest run took 9.40 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.25 µs per loop

but it is called several times. In this case I think twice, which gives ca 5µs on the total of ca 50µs (for one of the examples above). So certainly not huge, but for performance sensitive code, a significant part for something that is not necessary (in this case).

Other option (instead of the _take), is to speed-up the actual validation function for those cases:

In [61]: def validate_take(args, kwargs):
    ...:     if args or kwargs:
    ...:         nv.validate_take(args, kwargs)
    ...:         

In [62]: validate_take(tuple(), dict())

In [63]: %timeit validate_take(tuple(), dict())
The slowest run took 6.01 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 512 ns per loop

In [65]: %timeit validate_take(None, dict())
The slowest run took 12.67 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 322 ns per loop

or, could also do this in the take method itself:

def take(self, ..., **kwargs):
    if kwargs:
        nv.validate_take(tuple(), kwargs)
    ...

@gfyoung
Copy link
Member

gfyoung commented Mar 29, 2017

@jorisvandenbossche : Pre-checking kwargs or args is certainly a possibility.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2017

do we have micro benchmarks for this?

@@ -2378,7 +2378,8 @@ def take(self, indices, axis=0, convert=True, is_copy=False, **kwargs):
--------
numpy.ndarray.take
"""
nv.validate_take(tuple(), kwargs)
if kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume this is going to go back after #15850 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it in even then (for performance sensitive methods like take here). It is still 10x faster even after #15850 (although 10x is not that important anymore at those low numbers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh ok sure.

@jorisvandenbossche
Copy link
Member Author

do we have micro benchmarks for this?

We actually already do have some, but I should run them (at the top are only some basic %timeit timings)

@jorisvandenbossche
Copy link
Member Author

So the remaining test failure is related to the following:

In [1]: import pandas.util.testing as tm

In [3]: s = tm.makeObjectSeries()

In [4]: s
Out[4]: 
MgrcA7IMuH    2000-01-03 00:00:00
QbbI4vogXZ    2000-01-04 00:00:00
f37URcolZJ    2000-01-05 00:00:00
...
aQWgxosz9u    2000-02-09 00:00:00
HnXFRuUGov    2000-02-10 00:00:00
2SA0hNdHwt    2000-02-11 00:00:00
dtype: object

In [5]: s.values
Out[5]: 
array([datetime.datetime(2000, 1, 3, 0, 0),
       datetime.datetime(2000, 1, 4, 0, 0),
       datetime.datetime(2000, 1, 5, 0, 0),
...
       datetime.datetime(2000, 2, 9, 0, 0),
       datetime.datetime(2000, 2, 10, 0, 0),
       datetime.datetime(2000, 2, 11, 0, 0)], dtype=object)

In [6]: s.take([1,3,8])
Out[6]: 
QbbI4vogXZ    2000-01-04 00:00:00
zUVaKnBXUH    2000-01-06 00:00:00
TZT2OzuB7y    2000-01-13 00:00:00
dtype: object

In [7]: s.take([1,3,8]).values
Out[7]: 
array([datetime.datetime(2000, 1, 4, 0, 0),
       datetime.datetime(2000, 1, 6, 0, 0),
       datetime.datetime(2000, 1, 13, 0, 0)], dtype=object)

So in the above (this is the behaviour of this PR), after take this preserves the object dtype with datetime.datetime values (I suppose due to using the fastpath in series creation). While the test expects a datetime64 result.
Personally, I actually find the above behaviour more preferable, as simple positional indexing should not necessarily change the dtype. But of course we should agree on that changing the test is OK here (and pandas currently does/assumes such inference on many places, so it has possibly larger implications)

I ran the indexing benchmarks:

    before     after       ratio
  [1f890607] [6d2705cd]
+    8.90μs    16.21μs      1.82  period.period_standard_indexing.time_shallow_copy
+  240.70μs   364.02μs      1.51  indexing.DataFrameIndexing.time_iloc_dups
-   62.26ms    51.08ms      0.82  indexing.Int64Indexing.time_getitem_list_like
-    9.44μs     5.02μs      0.53  indexing.DataFrameIndexing.time_get_value_ix
-   76.00μs    38.18μs      0.50  indexing.Int64Indexing.time_iloc_list_like
-    1.76ms   136.69μs      0.08  indexing.Int64Indexing.time_iloc_array
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

So the iloc list_like and array improve as expected (the array case even more than I expected). The reason that the period shallow copy is much slower is not really clear to me (I don't think I touched code related to shallow_copy). But when I reran the benchmarks, both those tests with slowdown were not consistent slower, so probably noise.

@jreback
Copy link
Contributor

jreback commented Mar 31, 2017

@jorisvandenbossche yeah the fastpath=True does not re-infer things. So that seems to be more correct. This just highlites that preserving numpy semantics is hard, while have easy-to-use pandas ones.

If you can separate out this specific case (to a separate) test (and put a reference to this issue) would be great.

After you merge this, I think I will create some pandas2 markers in pytest and start adding here.

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Mar 31, 2017
@jreback
Copy link
Contributor

jreback commented Apr 1, 2017

@jorisvandenbossche I think travis was flaky, can you push again

@jreback
Copy link
Contributor

jreback commented Apr 1, 2017

lgtm. merge away.

@jreback jreback closed this in a57e681 Apr 1, 2017
@jreback
Copy link
Contributor

jreback commented Apr 1, 2017

thanks!

@jorisvandenbossche jorisvandenbossche deleted the perf-iloc-list branch October 16, 2017 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants