-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use precomputed lookup table in Color32::from_rgba_unmultiplied #5088
Conversation
Before the lookup table:
After the lookup table:
Edit: updated benchmarks after fixing |
Note that this will increase the binary size, of course, which is bad for WASM in particular. The alternative would be to build the lookup table at runtime, as suggested in the issue. The downsides are that it needs a dependency ( Yet another possibility would be to build a lookup table only for |
Isn't building the lookup table on the fly pretty much the same as (currently) displaying a 256x256 image with transparency? I would think it's not really that slow. Alternatively, the 1KB table could be used for WASM and the 64KB table for other platforms? (If there is enough of a difference between the two.) |
So I tried reverting the previous change and adding this: #[inline(always)]
pub fn linear_f32_from_gamma_u8(s: u8) -> f32 {
lookup::LINEAR_F32_FROM_GAMMA_U8[usize::from(s)]
}
#[inline(always)]
pub fn gamma_u8_from_linear_f32(l: f32) -> u8 {
let threshold = (l - 1e-4).clamp(0.0, 1.0);
let lower_bound = (threshold * 255.0) as u8;
(lower_bound..u8::MAX)
.find(|&s| linear_f32_from_gamma_u8(s) >= threshold)
.unwrap_or(u8::MAX)
} With a
Which is only about 4 times faster than the current implementation on master, and a lot slower than the version with the big lookup table that is currently in this PR. I tried to use Also, I realized that my benchmarks aren't right, because I'm not using |
7a4916f
to
2a3c1e1
Compare
I added the |
let b = gamma_u8_from_linear_f32(b_lin * a_lin); | ||
|
||
Self::from_rgba_premultiplied(r, g, b, a) | ||
match a { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth checking if skipping the match for the 0/255 cases is actually faster. At least on my machine, it is.
With match
from_rgba_unmultiplied_100
time: [2.0122 ns 2.0443 ns 2.0813 ns]
Just using the table
from_rgba_unmultiplied_100
time: [1.6584 ns 1.6675 ns 1.6772 ns]
change: [-18.696% -17.776% -16.940%] (p = 0.00 < 0.05)
Performance has improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the common cases become slower, so I decided to just leave it as it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I think it is worth avoiding adding 64kB to the .wasm size.
I think you should be able to replace once_cell
with std::cell::OnceCell
It was noted that it is fairly slow, so this benchmark is added to evaluate performance gains.
Using 1000 values so we can more easily infer the time for a single call.
This is about 50 times faster according to the benchmarks, at the expense of storing a 65KB lookup table in the program memory.
So we don't bloat the binary size, which would be a problem especially for WASM.
It is not useful anywhere else, so there is no need for it to be outside.
The table is built at runtime and used locally, so the endianness will always be the same on both sides.
cdefc55
to
984a8ce
Compare
Updated benchmarks: (Note that this now refers to the time to call the function 1000 times with different values) Without LUT
With LUT
The 0 case is slightly slower now, not sure why. But it is still way faster than the others. |
|
…k#5088) Improves performances significantly (about 40 times) according to the benchmarks. * Closes <emilk#5086> * [x] I have followed the instructions in the PR template
Improves performances significantly (about 40 times) according to the benchmarks.