Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative fix for mul! #34394

Closed
wants to merge 3 commits into from
Closed

alternative fix for mul! #34394

wants to merge 3 commits into from

Conversation

daviehh
Copy link
Contributor

@daviehh daviehh commented Jan 15, 2020

Currently, the MulAddMul is implemented as a callable structure for calculating a short-circuit version of x * a + y * b depending on whether a is 1 or b is 0. The short-circuit is implemented with "value types" on a pair of Booleans (ais1, bis0). This may lead to a performance regression on master, possibly due to constant propagation, details at JuliaLang/LinearAlgebra.jl#684

This is an alternative way to fix JuliaLang/LinearAlgebra.jl#684 by implementing the short-circuit callable structure MulAddMul differently, where the short-circuiting is done in the function body rather than depending on the dispatch system for the value type (ais1, bis0). The 3-argument mul! here might be a bit slower compared with 1.3.1 (~7-10 ns on my testings), not sure how to improve, but this might be less brittle than constant propagation?

test script:

using LinearAlgebra
using BenchmarkTools

ndim = 3
al = .5;
bt = .7im;

m1 = rand(ComplexF64,ndim,ndim);
m2 = rand(ComplexF64,ndim,ndim);
ou = rand(ComplexF64,ndim,ndim);

@btime mul!($ou, $m1, $m2);
@btime mul!($ou, $m1, $m2, $al, $bt);
@btime mul!($ou, $m1, $m2, .5, .7im);

on 1.3.1:
Julia Version 1.3.1
Commit 2d57411* (2019-12-30 21:36 UTC)

35.637 ns (0 allocations: 0 bytes)
350.521 ns (3 allocations: 112 bytes)
54.422 ns (0 allocations: 0 bytes)

on master:
Julia Version 1.5.0-DEV.71
Commit 15d693b (2020-01-15 18:13 UTC)

397.612 ns (1 allocation: 16 bytes)
369.288 ns (3 allocations: 80 bytes)
371.888 ns (3 allocations: 80 bytes)

this PR:
Julia Version 1.5.0-DEV.72
Commit 20f2720 (2020-01-15 20:16 UTC)

45.862 ns (0 allocations: 0 bytes)
59.554 ns (0 allocations: 0 bytes)
61.749 ns (0 allocations: 0 bytes)

@dkarrasch
Copy link
Member

I have wanted to move short-circuiting to the runtime code, but couldn't figure out how to do that with minimal invasion. This is a very elegant way to do that.

I wonder one thing, though: we take care of whether a is one, and avoid multiplication if it is. We don't do that with b, and haven't done that earlier. Is that because this is a special, yet not default case? Would it make sense/be worth it to add this logic here? It would add a second layer of branching, and I guess the one layer you added is responsible for the little slowdown.

Anyway, I think the fact that this handles 5-arg mul! nicely is a great achievement.

@tkf
Copy link
Member

tkf commented Jan 16, 2020

I commented elsewhere #34384 (comment) but these are in type-domain because I wanted to hoist the branchings out of the hot loops. See: #29634 (comment)

@daviehh
Copy link
Contributor Author

daviehh commented Jan 29, 2020

closed in favor of #34384

@daviehh daviehh closed this Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mul! performance regression on master
3 participants