Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use precomputed lookup table in Color32::from_rgba_unmultiplied #5088

Merged
merged 7 commits into from
Sep 10, 2024

Conversation

YgorSouza
Copy link
Contributor

@YgorSouza YgorSouza commented Sep 7, 2024

Improves performances significantly (about 40 times) according to the benchmarks.

@YgorSouza
Copy link
Contributor Author

YgorSouza commented Sep 7, 2024

Before the lookup table:

from_rgba_unmultiplied_0
                        time:   [252.79 ps 254.87 ps 257.04 ps]

from_rgba_unmultiplied_100
                        time:   [64.423 ns 64.762 ns 65.145 ns]

from_rgba_unmultiplied_255
                        time:   [788.93 ps 792.09 ps 795.81 ps]

After the lookup table:

from_rgba_unmultiplied_0
                        time:   [278.30 ps 283.50 ps 289.02 ps]
                        change: [+12.801% +15.804% +18.386%] (p = 0.00 < 0.05)
                        Performance has regressed.

from_rgba_unmultiplied_100
                        time:   [1.5295 ns 1.5319 ns 1.5343 ns]
                        change: [-97.622% -97.613% -97.604%] (p = 0.00 < 0.05)
                        Performance has improved.

from_rgba_unmultiplied_255
                        time:   [724.31 ps 725.98 ps 727.53 ps]
                        change: [-8.9849% -8.6971% -8.3678%] (p = 0.00 < 0.05)
                        Performance has improved.

Edit: updated benchmarks after fixing black_box mistakes.

@YgorSouza
Copy link
Contributor Author

Note that this will increase the binary size, of course, which is bad for WASM in particular. The alternative would be to build the lookup table at runtime, as suggested in the issue. The downsides are that it needs a dependency (once_cell, which is already used in a lot of places and can be removed once the MSRV moves to 1.80) and that the first call would be very slow.

Yet another possibility would be to build a lookup table only for linear_f32_from_gamma_u8 instead, which would be only 1KB, and could be used both ways (with partition_point). powf seems to be the bottleneck according to the profile posted in the issue, so maybe that would be good enough. I'm going to give it a try.

@YgorSouza YgorSouza marked this pull request as draft September 8, 2024 05:38
@veyh
Copy link

veyh commented Sep 8, 2024

Isn't building the lookup table on the fly pretty much the same as (currently) displaying a 256x256 image with transparency? I would think it's not really that slow.

Alternatively, the 1KB table could be used for WASM and the 64KB table for other platforms? (If there is enough of a difference between the two.)

@YgorSouza
Copy link
Contributor Author

Yet another possibility would be to build a lookup table only for linear_f32_from_gamma_u8 instead, which would be only 1KB, and could be used both ways (with partition_point). powf seems to be the bottleneck according to the profile posted in the issue, so maybe that would be good enough. I'm going to give it a try.

So I tried reverting the previous change and adding this:

#[inline(always)]
pub fn linear_f32_from_gamma_u8(s: u8) -> f32 {
    lookup::LINEAR_F32_FROM_GAMMA_U8[usize::from(s)]
}

#[inline(always)]
pub fn gamma_u8_from_linear_f32(l: f32) -> u8 {
    let threshold = (l - 1e-4).clamp(0.0, 1.0);
    let lower_bound = (threshold * 255.0) as u8;
    (lower_bound..u8::MAX)
        .find(|&s| linear_f32_from_gamma_u8(s) >= threshold)
        .unwrap_or(u8::MAX)
}

With a [f32; 256] array I generated and pasted directly in the code. The - 1e-4 is needed for the unit tests to pass, because of rounding errors. I got this result on the bench:

from_rgba_unmultiplied_0
                        time:   [127.57 ps 127.86 ps 128.14 ps]

from_rgba_unmultiplied_100
                        time:   [15.520 ns 15.545 ns 15.570 ns]

from_rgba_unmultiplied_255
                        time:   [134.04 ps 134.38 ps 134.68 ps]

Which is only about 4 times faster than the current implementation on master, and a lot slower than the version with the big lookup table that is currently in this PR. I tried to use partition_point instead of a linear search, but it was even slower. I'm going to see if I can come up with something better, otherwise I'll just leave it as it is currently, or try the once_cell approach.

Also, I realized that my benchmarks aren't right, because I'm not using black_box on the inputs, which is allowing the compiler to do more work at compile time than it's supposed to. I'm going to fix that and edit all the benchmarks I've posted so far with the corrected values.

@YgorSouza
Copy link
Contributor Author

I added the once_cell version. The benchmark went up to 2.9ns (from 1.5ns with the array in program memory). I guess dereferencing the Lazy isn't completely free. It's still 20 times faster than the current implementation. I left it as a separate commit so it can be easily reverted depending on which tradeoff is preferred. And I think I'm not going to bother with the f32 lookup table for the gamma value I mentioned above. Too finicky, and I couldn't get it to perform well enough to be worth the trouble.

let b = gamma_u8_from_linear_f32(b_lin * a_lin);

Self::from_rgba_premultiplied(r, g, b, a)
match a {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth checking if skipping the match for the 0/255 cases is actually faster. At least on my machine, it is.

With match

from_rgba_unmultiplied_100
                        time:   [2.0122 ns 2.0443 ns 2.0813 ns]

Just using the table

from_rgba_unmultiplied_100
                        time:   [1.6584 ns 1.6675 ns 1.6772 ns]
                        change: [-18.696% -17.776% -16.940%] (p = 0.00 < 0.05)
                        Performance has improved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the common cases become slower, so I decided to just leave it as it was before.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is true.

@YgorSouza YgorSouza marked this pull request as ready for review September 8, 2024 18:54
Copy link
Owner

@emilk emilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I think it is worth avoiding adding 64kB to the .wasm size.

I think you should be able to replace once_cell with std::cell::OnceCell

crates/ecolor/src/color32.rs Outdated Show resolved Hide resolved
crates/epaint/benches/benchmark.rs Outdated Show resolved Hide resolved
It was noted that it is fairly slow, so this benchmark is added to
evaluate performance gains.
Using 1000 values so we can more easily infer the time for a single
call.
This is about 50 times faster according to the benchmarks, at the
expense of storing a 65KB lookup table in the program memory.
So we don't bloat the binary size, which would be a problem especially
for WASM.
It is not useful anywhere else, so there is no need for it to be
outside.
The table is built at runtime and used locally, so the endianness will
always be the same on both sides.
@YgorSouza
Copy link
Contributor Author

Updated benchmarks:

(Note that this now refers to the time to call the function 1000 times with different values)

Without LUT

from_rgba_unmultiplied_0
                        time:   [416.57 ns 419.80 ns 422.97 ns]

from_rgba_unmultiplied_other
                        time:   [75.650 µs 75.769 µs 75.856 µs]

from_rgba_unmultiplied_255
                        time:   [1.0988 µs 1.1004 µs 1.1020 µs]

With LUT

from_rgba_unmultiplied_0
                        time:   [501.18 ns 501.74 ns 502.29 ns]
                        change: [+20.148% +21.106% +21.946%] (p = 0.00 < 0.05)
                        Performance has regressed.

from_rgba_unmultiplied_other
                        time:   [1.6264 µs 1.6273 µs 1.6281 µs]
                        change: [-97.868% -97.856% -97.846%] (p = 0.00 < 0.05)
                        Performance has improved.

from_rgba_unmultiplied_255
                        time:   [1.1058 µs 1.1067 µs 1.1075 µs]
                        change: [+0.2530% +0.4257% +0.5968%] (p = 0.00 < 0.05)
                        Change within noise threshold.

The 0 case is slightly slower now, not sure why. But it is still way faster than the others.

@YgorSouza
Copy link
Contributor Author

Thanks for working on this! I think it is worth avoiding adding 64kB to the .wasm size.

I think you should be able to replace once_cell with std::cell::OnceCell

OnceCell can't be used in static (not thread-safe), but I used OnceLock. It's actually twice as fast as the once_cell crate too.

@emilk emilk added performance Lower CPU/GPU usage (optimize) epaint labels Sep 10, 2024
@emilk emilk merged commit f897405 into emilk:master Sep 10, 2024
21 of 22 checks passed
hacknus pushed a commit to hacknus/egui that referenced this pull request Oct 30, 2024
…k#5088)

Improves performances significantly (about 40 times) according to the
benchmarks.

* Closes <emilk#5086>
* [x] I have followed the instructions in the PR template
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epaint performance Lower CPU/GPU usage (optimize)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

optimize Color32::from_rgba_unmultiplied
3 participants