Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Some asv tests take a long time to setup #16803

Closed
TomAugspurger opened this issue Jun 30, 2017 · 11 comments
Closed

PERF: Some asv tests take a long time to setup #16803

TomAugspurger opened this issue Jun 30, 2017 · 11 comments
Labels
Benchmark Performance (ASV) benchmarks Performance Memory or execution speed performance

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 30, 2017

I hacked asv to log the total execution time, including setup, for each benchmark. Some of these are parametrized over several cases, so they may not actually be slow. time is in seconds.

time file klass method
68 00:03:11.379267 frame_ctor FrameConstructorDTIndexFromOffsets time_frame_ctor
394 00:01:01.671905 inference to_numeric_downcast time_downcast
199 00:00:46.330012 groupby GroupBySuite time_describe
559 00:00:24.698904 replace replace_large_dict time_replace_large_dict
204 00:00:24.212386 groupby GroupBySuite time_mad
210 00:00:22.481497 groupby GroupBySuite time_pct_change
215 00:00:18.909368 groupby GroupBySuite time_skew
200 00:00:18.732072 groupby GroupBySuite time_diff
212 00:00:18.317290 groupby GroupBySuite time_rank
219 00:00:16.845357 groupby GroupBySuite time_unique

Ideally we could optimize the setup time on these as well. We could maybe modify the benchmark to do less work / run faster, but I'd like to avoid that if possible.

Link to the full CSV: https://gist.github.com/9d80aa45750224d7453863f2f754160d

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Jun 30, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jun 30, 2017
@jorisvandenbossche
Copy link
Member

One of the things I already have thought about before is that we probably can do less in the setup ?
I mean, is it needed to create the DataFrames in a setup (which gets repeated a lot), or would it be sufficient to just create them once in the file? (as long as the benchmarks don't modify it)

Eg https://github.com/pandas-dev/pandas/blob/master/asv_bench/benchmarks/groupby.py#L399 we could do the actual creation of the frame outside setup, and in the setup only pick the right one based on the parameters.

@TomAugspurger TomAugspurger changed the title PERF: Some vbench tests take a long time to setup PERF: Some asv tests take a long time to setup Jun 30, 2017
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 30, 2017 via email

@jreback
Copy link
Contributor

jreback commented Jul 1, 2017

is this actually a big deal?

@jbrockmendel
Copy link
Member

Is there a measure of the relative importance of these benchmarks? i.e. if I find improvements in subset A and slowdowns in subset B, when can this count as a win?

@jorisvandenbossche
Copy link
Member

@jbrockmendel there is no 'measure' for that, apart from common sense.

But note that this is issue is not about that, it is about the fact that some tests take a long time to run (not because of pandas being too slow, but because of the set-up of the test)

@jbrockmendel
Copy link
Member

For some of these, e.g. reindex.Reindexing, there is a lot of unnecessary setup being done. df2 is constructed for every call of all three benchmarks, but is only used in one. Same for s1 and s2. It isn't obvious how much mileage can be gained by trimming these down though.

@jbrockmendel
Copy link
Member

@TomAugspurger can you post the hack you used to measure setup time? I'm poking at this instead of doing real work.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 14, 2017

Happy to encourage productive procrastination :)

This is the diff in my local copy (which is based on a pretty old commit now)

diff --git a/asv/benchmarks.py b/asv/benchmarks.py
index 60442b9..b3b96ba 100644
--- a/asv/benchmarks.py
+++ b/asv/benchmarks.py
@@ -89,6 +89,8 @@ def run_benchmark(benchmark, root, env, show_stderr=False,
 
         - `errcode`: Error return code.
     """
+    import time
+    t0 = time.time()
     name = benchmark['name']
     result = {'stderr': '', 'profile': None, 'errcode': 0}
     bench_results = []
@@ -230,6 +232,10 @@ def run_benchmark(benchmark, root, env, show_stderr=False,
             with log.indent():
                 log.error(result['stderr'])
 
+        t1 = time.time()
+        line = '{},{}\n'.format(name, t1 - t0)
+        with open("log.csv", "a") as f:
+            f.write(line)
         return result
 
 
@@ -535,7 +541,6 @@ class Benchmarks(dict):
               and be a byte string containing the cProfile data.
         """
         log.info("Benchmarking {0}".format(env.name))
-
         with log.indent():
             benchmarks = sorted(list(six.iteritems(self)))
 

@jbrockmendel
Copy link
Member

Thanks, worked like a charm. What didn't work was my attempt to short-cut setup calls by defining attributes at the class level.

@jorisvandenbossche
Copy link
Member

Thanks, worked like a charm. What didn't work was my attempt to short-cut setup calls by defining attributes at the class level.

I think using 'global' data works, so defined on the module level (as long as it is only used for benchmarks that don't modify data). In some cases (if the module is not too long, something generic can be reused) that can be used I think.

@mroeschke mroeschke added the Benchmark Performance (ASV) benchmarks label Feb 4, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

I think this has been superseded by #44450 so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants