Use precomputed lookup table in Color32::from_rgba_unmultiplied #5088

YgorSouza · 2024-09-07T21:18:48Z

Improves performances significantly (about 40 times) according to the benchmarks.

Closes optimize Color32::from_rgba_unmultiplied #5086
I have followed the instructions in the PR template

YgorSouza · 2024-09-07T21:32:27Z

Before the lookup table:

from_rgba_unmultiplied_0
                        time:   [252.79 ps 254.87 ps 257.04 ps]

from_rgba_unmultiplied_100
                        time:   [64.423 ns 64.762 ns 65.145 ns]

from_rgba_unmultiplied_255
                        time:   [788.93 ps 792.09 ps 795.81 ps]

After the lookup table:

from_rgba_unmultiplied_0
                        time:   [278.30 ps 283.50 ps 289.02 ps]
                        change: [+12.801% +15.804% +18.386%] (p = 0.00 < 0.05)
                        Performance has regressed.

from_rgba_unmultiplied_100
                        time:   [1.5295 ns 1.5319 ns 1.5343 ns]
                        change: [-97.622% -97.613% -97.604%] (p = 0.00 < 0.05)
                        Performance has improved.

from_rgba_unmultiplied_255
                        time:   [724.31 ps 725.98 ps 727.53 ps]
                        change: [-8.9849% -8.6971% -8.3678%] (p = 0.00 < 0.05)
                        Performance has improved.

Edit: updated benchmarks after fixing black_box mistakes.

YgorSouza · 2024-09-08T04:46:20Z

Note that this will increase the binary size, of course, which is bad for WASM in particular. The alternative would be to build the lookup table at runtime, as suggested in the issue. The downsides are that it needs a dependency (once_cell, which is already used in a lot of places and can be removed once the MSRV moves to 1.80) and that the first call would be very slow.

Yet another possibility would be to build a lookup table only for linear_f32_from_gamma_u8 instead, which would be only 1KB, and could be used both ways (with partition_point). powf seems to be the bottleneck according to the profile posted in the issue, so maybe that would be good enough. I'm going to give it a try.

veyh · 2024-09-08T14:19:07Z

Isn't building the lookup table on the fly pretty much the same as (currently) displaying a 256x256 image with transparency? I would think it's not really that slow.

Alternatively, the 1KB table could be used for WASM and the 64KB table for other platforms? (If there is enough of a difference between the two.)

YgorSouza · 2024-09-08T14:33:59Z

Yet another possibility would be to build a lookup table only for linear_f32_from_gamma_u8 instead, which would be only 1KB, and could be used both ways (with partition_point). powf seems to be the bottleneck according to the profile posted in the issue, so maybe that would be good enough. I'm going to give it a try.

So I tried reverting the previous change and adding this:

#[inline(always)]
pub fn linear_f32_from_gamma_u8(s: u8) -> f32 {
    lookup::LINEAR_F32_FROM_GAMMA_U8[usize::from(s)]
}

#[inline(always)]
pub fn gamma_u8_from_linear_f32(l: f32) -> u8 {
    let threshold = (l - 1e-4).clamp(0.0, 1.0);
    let lower_bound = (threshold * 255.0) as u8;
    (lower_bound..u8::MAX)
        .find(|&s| linear_f32_from_gamma_u8(s) >= threshold)
        .unwrap_or(u8::MAX)
}

With a [f32; 256] array I generated and pasted directly in the code. The - 1e-4 is needed for the unit tests to pass, because of rounding errors. I got this result on the bench:

from_rgba_unmultiplied_0
                        time:   [127.57 ps 127.86 ps 128.14 ps]

from_rgba_unmultiplied_100
                        time:   [15.520 ns 15.545 ns 15.570 ns]

from_rgba_unmultiplied_255
                        time:   [134.04 ps 134.38 ps 134.68 ps]

Which is only about 4 times faster than the current implementation on master, and a lot slower than the version with the big lookup table that is currently in this PR. I tried to use partition_point instead of a linear search, but it was even slower. I'm going to see if I can come up with something better, otherwise I'll just leave it as it is currently, or try the once_cell approach.

Also, I realized that my benchmarks aren't right, because I'm not using black_box on the inputs, which is allowing the compiler to do more work at compile time than it's supposed to. I'm going to fix that and edit all the benchmarks I've posted so far with the corrected values.

YgorSouza · 2024-09-08T18:51:47Z

I added the once_cell version. The benchmark went up to 2.9ns (from 1.5ns with the array in program memory). I guess dereferencing the Lazy isn't completely free. It's still 20 times faster than the current implementation. I left it as a separate commit so it can be easily reverted depending on which tradeoff is preferred. And I think I'm not going to bother with the f32 lookup table for the gamma value I mentioned above. Too finicky, and I couldn't get it to perform well enough to be worth the trouble.

veyh · 2024-09-08T18:53:47Z

crates/ecolor/src/color32.rs

-            let b = gamma_u8_from_linear_f32(b_lin * a_lin);
-
-            Self::from_rgba_premultiplied(r, g, b, a)
+        match a {


Might be worth checking if skipping the match for the 0/255 cases is actually faster. At least on my machine, it is.

With match

from_rgba_unmultiplied_100 time: [2.0122 ns 2.0443 ns 2.0813 ns]

Just using the table

from_rgba_unmultiplied_100 time: [1.6584 ns 1.6675 ns 1.6772 ns] change: [-18.696% -17.776% -16.940%] (p = 0.00 < 0.05) Performance has improved.

Yes, but the common cases become slower, so I decided to just leave it as it was before.

Yeah, that is true.

emilk

Thanks for working on this! I think it is worth avoiding adding 64kB to the .wasm size.

I think you should be able to replace once_cell with std::cell::OnceCell

crates/ecolor/src/color32.rs

crates/epaint/benches/benchmark.rs

It was noted that it is fairly slow, so this benchmark is added to evaluate performance gains.

Using 1000 values so we can more easily infer the time for a single call.

This is about 50 times faster according to the benchmarks, at the expense of storing a 65KB lookup table in the program memory.

So we don't bloat the binary size, which would be a problem especially for WASM.

It is not useful anywhere else, so there is no need for it to be outside.

The table is built at runtime and used locally, so the endianness will always be the same on both sides.

YgorSouza · 2024-09-09T17:34:02Z

Updated benchmarks:

(Note that this now refers to the time to call the function 1000 times with different values)

Without LUT

from_rgba_unmultiplied_0
                        time:   [416.57 ns 419.80 ns 422.97 ns]

from_rgba_unmultiplied_other
                        time:   [75.650 µs 75.769 µs 75.856 µs]

from_rgba_unmultiplied_255
                        time:   [1.0988 µs 1.1004 µs 1.1020 µs]

With LUT

from_rgba_unmultiplied_0
                        time:   [501.18 ns 501.74 ns 502.29 ns]
                        change: [+20.148% +21.106% +21.946%] (p = 0.00 < 0.05)
                        Performance has regressed.

from_rgba_unmultiplied_other
                        time:   [1.6264 µs 1.6273 µs 1.6281 µs]
                        change: [-97.868% -97.856% -97.846%] (p = 0.00 < 0.05)
                        Performance has improved.

from_rgba_unmultiplied_255
                        time:   [1.1058 µs 1.1067 µs 1.1075 µs]
                        change: [+0.2530% +0.4257% +0.5968%] (p = 0.00 < 0.05)
                        Change within noise threshold.

The 0 case is slightly slower now, not sure why. But it is still way faster than the others.

YgorSouza · 2024-09-09T17:37:48Z

Thanks for working on this! I think it is worth avoiding adding 64kB to the .wasm size.

I think you should be able to replace once_cell with std::cell::OnceCell

OnceCell can't be used in static (not thread-safe), but I used OnceLock. It's actually twice as fast as the once_cell crate too.

…k#5088) Improves performances significantly (about 40 times) according to the benchmarks. * Closes <emilk#5086> * [x] I have followed the instructions in the PR template

YgorSouza marked this pull request as draft September 8, 2024 05:38

YgorSouza force-pushed the color32-from-unmultiplied-lut branch from 7a4916f to 2a3c1e1 Compare September 8, 2024 18:28

veyh reviewed Sep 8, 2024

View reviewed changes

YgorSouza marked this pull request as ready for review September 8, 2024 18:54

emilk approved these changes Sep 9, 2024

View reviewed changes

crates/ecolor/src/color32.rs Outdated Show resolved Hide resolved

crates/epaint/benches/benchmark.rs Outdated Show resolved Hide resolved

YgorSouza added 7 commits September 9, 2024 19:23

Add benchmarks for the Color32 unmultiplied constructor

888cc13

It was noted that it is fairly slow, so this benchmark is added to evaluate performance gains.

Run color32 bench with multiple values

c771237

Using 1000 values so we can more easily infer the time for a single call.

Use a precomputed lookup table in Color32::from_unmultiplied

e753a13

This is about 50 times faster according to the benchmarks, at the expense of storing a 65KB lookup table in the program memory.

Compute lookup table at runtime using once_cell

9700886

So we don't bloat the binary size, which would be a problem especially for WASM.

Move lookup table into the function

0bfaa71

It is not useful anywhere else, so there is no need for it to be outside.

Use native endianness instead of big endian

b84ac85

The table is built at runtime and used locally, so the endianness will always be the same on both sides.

Replace once_cell by std::sync::OnceLock

984a8ce

YgorSouza force-pushed the color32-from-unmultiplied-lut branch from cdefc55 to 984a8ce Compare September 9, 2024 17:32

emilk added performance Lower CPU/GPU usage (optimize) epaint labels Sep 10, 2024

emilk approved these changes Sep 10, 2024

View reviewed changes

emilk merged commit f897405 into emilk:master Sep 10, 2024
21 of 22 checks passed

lucasmerlin mentioned this pull request Oct 2, 2024

EXC_BAD_ACCESS in Color32::from_rgba_unmultiplied on iOS #5210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use precomputed lookup table in Color32::from_rgba_unmultiplied #5088

Use precomputed lookup table in Color32::from_rgba_unmultiplied #5088

YgorSouza commented Sep 7, 2024 •

edited

Loading

YgorSouza commented Sep 7, 2024 •

edited

Loading

YgorSouza commented Sep 8, 2024

veyh commented Sep 8, 2024 •

edited

Loading

YgorSouza commented Sep 8, 2024

YgorSouza commented Sep 8, 2024

veyh Sep 8, 2024

YgorSouza Sep 8, 2024

veyh Sep 8, 2024

emilk left a comment

YgorSouza commented Sep 9, 2024

YgorSouza commented Sep 9, 2024

Use precomputed lookup table in Color32::from_rgba_unmultiplied #5088

Use precomputed lookup table in Color32::from_rgba_unmultiplied #5088

Conversation

YgorSouza commented Sep 7, 2024 • edited Loading

YgorSouza commented Sep 7, 2024 • edited Loading

YgorSouza commented Sep 8, 2024

veyh commented Sep 8, 2024 • edited Loading

YgorSouza commented Sep 8, 2024

YgorSouza commented Sep 8, 2024

veyh Sep 8, 2024

Choose a reason for hiding this comment

YgorSouza Sep 8, 2024

Choose a reason for hiding this comment

veyh Sep 8, 2024

Choose a reason for hiding this comment

emilk left a comment

Choose a reason for hiding this comment

YgorSouza commented Sep 9, 2024

Without LUT

With LUT

YgorSouza commented Sep 9, 2024

YgorSouza commented Sep 7, 2024 •

edited

Loading

YgorSouza commented Sep 7, 2024 •

edited

Loading

veyh commented Sep 8, 2024 •

edited

Loading