-
-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't broadcast out-of-place #235
Comments
@rveltz this should give similar speedups on ArrayFire arrays. Do you have any test equations you'd like us to run? |
BTW @Datseris, this makes static array paths finally faster. f2(t,u,du) = (du.=α*u)
g2(t,u,du) = (du.=β*u)
u0 = [2.0,3.0]
prob = ODEProblem(f2,u0,(0.0,1.0))
@btime sol1 = solve(prob,BS3(),dt=dt,adaptive=false) gives 24.005 μs (281 allocations: 20.55 KiB) |
Let me know when this is operational in a stable version and then I will benchmark with "real" multidimensional systems. The difference of 5 μs is not big enough at the moment to make it compelling, plus the in-place version is always scalable: the performance will be "nice" regardless of dimensionallity. Can the same be said about static arrays? Normally you would use static arrays and expect to get orders of magnitude better performance for small dimensionality. Here it is not even close to a single order of magnitude and the dimension is a mere 2, the lowest possible. P.S.: I was always thinking that there is so much crazy amount of work put into supporting both in-place and out of place versions, but never really understood why,... Why even bother to support the out-of-place? |
No, static arrays will not scale. That's their limitation.
orders? No. Maybe an order of magnitude, but if that's true then either the mutable code could be faster or it's using a static array fast path. Static array fast paths, like the fast QR, is so nice when you get it though!
Well, it's really not that much work since it's already written. Probably <5% of my time is spent working on them. Also, they are easier to get working, so when I am developing a new algorithm I always write the out of place version first, so why not keep it? They are more efficient when applicable, and they can be easier for users to do some things efficiently. For example, you don't have to use |
Fair enough. I do have a comment about the speedup. You can indeed get more than one order in the case of 2 and 3 dimensions, this is what I realized when working with discrete systems. Of course at my use case that was because of the absurdly fast QR method native for 2 and 3D static matrices, which made my function crazy fast. But its probably fair to say that expecting more than one order in a differential equation setting is asking too much. EDIT: Let me know though when this is in a stable release, and I will test it just so that I can help the ecosystem. |
Yes, for a non-stiff diffeq solver it's literally the difference of @. tmp = a1*k1 + k2*k2 with vectors vs tmp = a1*k1 + a2*k2 with static arrays, and both are quite optimized in these cases. It's the "harder parts", which would only show up in stiff solvers (because of linear solvers, that difference in the QR is huge!) or in the user's |
Done here and StochasticDiffEq.jl. Wasn't bad at all 👍 |
It seems that broadcast is harmful to most uses of the out-of-place format. It drastically slows down things like static arrays and ArrayFire arrays that are made to be used out-of-place. Since we should prioritize these for this path, it makes more sense for this path to be non-broadcasting. The reason we put broadcasting there in the first place was for fusion, but fusion is only required if one thinks that temporary allocations are bad for their type, which is opposite of the assumption that the out-of-place branch should be making. If one needs broadcasting, they should be using the route for mutables anyways.
This reverts some hard work by @devmotion, we probably should've checked more into this first. I'm sorry about that but still appreciate your hard work. But the net result is from
to
so it's huge for
and puts us on similar grounds with BridgeDiffEq.jl which was made specifically for fixed timestep static arrays, so this is the "true optimization" which we were missing. (
16.979 μs (698 allocations: 20.98 KiB)
for Bridge. Most of the difference in the profiling is now just building the integrator type and other small setup bits which are a constant factor). It also evades #106 for out-of-place which will cleanup our code quite a bit. I'll get to work on getting this package-wide changed.The text was updated successfully, but these errors were encountered: