Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wall-clock time plots #16

Open
jotaf98 opened this issue Apr 22, 2020 · 4 comments
Open

Wall-clock time plots #16

jotaf98 opened this issue Apr 22, 2020 · 4 comments

Comments

@jotaf98
Copy link

jotaf98 commented Apr 22, 2020

Optimizers can have very different runtimes per iteration, especially 2nd-order ones.

This means that sometimes, despite promises of "faster" convergence, the wall-clock time taken to converge is disappointingly larger.

Is there any chance DeepOBS could implement wall-clock time plots, in addition to per-epoch ones? (E.g. X axis in minutes or hours.)

@fsschneider
Copy link

Hi,

we thought about this, but decided against it.

First of all, measurements of wall-clock times can be very noisy and sensitive to the hardware, other processes running in parallel etc. It is very hard to reproduce wall-clock timings.

Secondly, it would defeat parts of the purpose of DeepOBS since everyone would need to re-run the baselines. In order to create wall-clock time plots, I would need to run all experiments on the same hardware. So if you come up with a new second-order optimizer, you would need to re-run the baselines of SGD, Adam & co. on your hardware.

However, we totally agree that wall-clock performance is highly relevant. At the end, this is what matters. So with DeepOBS we provide the option to estimate the per-epoch overtime that one optimizer requires compared to SGD.

You can check out the estimate_runtime script in the main repo (https://github.com/fsschneider/DeepOBS/blob/master/deepobs/scripts/deepobs_estimate_runtime.py). It basically allows you to estimate how much longer each epoch takes for your optimizer compared to SGD. This should help to mentally transform the "per-epoch" plots into a wall-clock time.

You can also check out section 3.5 of our paper (https://openreview.net/pdf?id=rJg6ssC5Y7)

@jotaf98
Copy link
Author

jotaf98 commented May 11, 2020

Thanks, the "equivalent wall-clock time" seems like a reasonable compromise!

It seems like it's missing just the next logical step to allow comparing 1st and 2nd order solvers -- using this equivalent time as the X-axis instead of iterations or epochs; or put another way, applying the multiplier over SGD to the X-axis of each optimizer.

This multiplier will depend on the dataset and architecture, so it would be useful to have your computations of it not just for an MLP on MNIST.

Otherwise, as long as a user of DeepOBS is willing to rerun all the benchmarks on a dedicated machine (as I am!), the timings should be fair and comparable.

@fsschneider
Copy link

fsschneider commented May 11, 2020

I agree that it would be a nice optional feature. However, it is currently not very high on our to-do list.

I also want to note one thing: There are two different "timings" that are both relevant for deep learning optimizers. The first is the actual wall-clock time now. This would behave exactly like you mentioned and the logical step would be indeed to plot all optimizers on join "wall-clock X-axis".

The second timing could be called "optimal wall-clock". The deep learning frameworks we currently use are very optimized for first-order methods. It is super easy to extract the gradient, but getting the individual gradients is already a huge pain. Well,theoretically this is easy and super cheap, but practically is quite hard (our group tackled the issue with this package https://backpack.pt/).
In this second case, it would be rather unfair to compare "wall-clock" time as it is severely limited by the current framework (I would, however, agree that "wall-clock" time is still interesting since it describes what a user can expect "right now" rather than in some optimal limit).

This is why we decided to go for this "hybrid approach". Plot using iterations, but always report "equivalent wall-clock time". Ideally, this would be coupled with a discussion in the paper of the theoretically optimal run-time.

By the way, you can do this "runtime overhead computation" for any test problem you want. MLP on MNIST is just the default. You can specify the test problem, the batch size and the epochs in order to get a fair evaluation (e.g. some optimizer might have a high initial cost that amortizes over time).

Btw, thanks for this ongoing discussion. I am convinced that DeepOBS is not the optimal solution for benchmarking (but hopefully a big step in the right direction). We are always interested in ways to improve it and discussions like these are exactly the way to achieve this.

@jotaf98
Copy link
Author

jotaf98 commented May 11, 2020

The second timing could be called "optimal wall-clock". The deep learning frameworks we currently use are very optimized for first-order methods. It is super easy to extract the gradient, but getting the individual gradients is already a huge pain. Well,theoretically this is easy and super cheap, but practically is quite hard (our group tackled the issue with this package https://backpack.pt/).

This is true, this affects for example CurveBall (as can be seen in the note in the PyTorch port's readme) -- different frameworks have very different runtimes (PyTorch currently being the worst unfortunately).

I guess using the "number of passes through the network" as a proxy for runtime could be another option, close to the theoretically-optimal time.

But for someone who just wants to use optimizers, this distinction can be deeply disappointing when they actually run it. Talking to other practitioners this is a common complaint and that's why I'm trying to run DeepOBS with a bit more focus on the timings; I'll see about implementing some of these ideas.

Also, thanks for your group's hard work on backpack, it's really good!

We are always interested in ways to improve it and discussions like these are exactly the way to achieve this.

My pleasure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants