-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wall-clock time plots #16
Comments
Hi, we thought about this, but decided against it. First of all, measurements of wall-clock times can be very noisy and sensitive to the hardware, other processes running in parallel etc. It is very hard to reproduce wall-clock timings. Secondly, it would defeat parts of the purpose of DeepOBS since everyone would need to re-run the baselines. In order to create wall-clock time plots, I would need to run all experiments on the same hardware. So if you come up with a new second-order optimizer, you would need to re-run the baselines of SGD, Adam & co. on your hardware. However, we totally agree that wall-clock performance is highly relevant. At the end, this is what matters. So with DeepOBS we provide the option to estimate the per-epoch overtime that one optimizer requires compared to SGD. You can check out the You can also check out section 3.5 of our paper (https://openreview.net/pdf?id=rJg6ssC5Y7) |
Thanks, the "equivalent wall-clock time" seems like a reasonable compromise! It seems like it's missing just the next logical step to allow comparing 1st and 2nd order solvers -- using this equivalent time as the X-axis instead of iterations or epochs; or put another way, applying the multiplier over SGD to the X-axis of each optimizer. This multiplier will depend on the dataset and architecture, so it would be useful to have your computations of it not just for an MLP on MNIST. Otherwise, as long as a user of DeepOBS is willing to rerun all the benchmarks on a dedicated machine (as I am!), the timings should be fair and comparable. |
I agree that it would be a nice optional feature. However, it is currently not very high on our to-do list. I also want to note one thing: There are two different "timings" that are both relevant for deep learning optimizers. The first is the actual wall-clock time now. This would behave exactly like you mentioned and the logical step would be indeed to plot all optimizers on join "wall-clock X-axis". The second timing could be called "optimal wall-clock". The deep learning frameworks we currently use are very optimized for first-order methods. It is super easy to extract the gradient, but getting the individual gradients is already a huge pain. Well,theoretically this is easy and super cheap, but practically is quite hard (our group tackled the issue with this package https://backpack.pt/). This is why we decided to go for this "hybrid approach". Plot using iterations, but always report "equivalent wall-clock time". Ideally, this would be coupled with a discussion in the paper of the theoretically optimal run-time. By the way, you can do this "runtime overhead computation" for any test problem you want. MLP on MNIST is just the default. You can specify the test problem, the batch size and the epochs in order to get a fair evaluation (e.g. some optimizer might have a high initial cost that amortizes over time). Btw, thanks for this ongoing discussion. I am convinced that DeepOBS is not the optimal solution for benchmarking (but hopefully a big step in the right direction). We are always interested in ways to improve it and discussions like these are exactly the way to achieve this. |
This is true, this affects for example CurveBall (as can be seen in the note in the PyTorch port's readme) -- different frameworks have very different runtimes (PyTorch currently being the worst unfortunately). I guess using the "number of passes through the network" as a proxy for runtime could be another option, close to the theoretically-optimal time. But for someone who just wants to use optimizers, this distinction can be deeply disappointing when they actually run it. Talking to other practitioners this is a common complaint and that's why I'm trying to run DeepOBS with a bit more focus on the timings; I'll see about implementing some of these ideas. Also, thanks for your group's hard work on backpack, it's really good!
My pleasure! |
Optimizers can have very different runtimes per iteration, especially 2nd-order ones.
This means that sometimes, despite promises of "faster" convergence, the wall-clock time taken to converge is disappointingly larger.
Is there any chance DeepOBS could implement wall-clock time plots, in addition to per-epoch ones? (E.g. X axis in minutes or hours.)
The text was updated successfully, but these errors were encountered: