Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastclear Bug With Intel Mesa Adapters on the GL Backend #1627

Closed
zicklag opened this issue Jul 10, 2021 · 24 comments · Fixed by #1717
Closed

Fastclear Bug With Intel Mesa Adapters on the GL Backend #1627

zicklag opened this issue Jul 10, 2021 · 24 comments · Fixed by #1717
Labels
api: gles Issues with GLES or WebGL external: driver-bug A driver is causing the bug, though we may still want to work around it type: bug Something isn't working

Comments

@zicklag
Copy link
Contributor

zicklag commented Jul 10, 2021

Description
When using the OpenGL backend on Linux the clear color seems to behaving strangely. For one, the color itself is lighter than it should be. For two, the borders around any objects drawn on top of the clear color have an even lighter, pixelated version of the clear color.

I'm currently troubleshooting but I opened the issue to start the discussion. With my experimentation so far it seems completely related to the act of clearing the draw framebuffer. All other rendering and examples seem to work fine, and if you have an example where you can't see the clear color, such as the skybox example, everything looks great.

Also I've noticed some weird related behavior in my Renderdoc captures:

When I launch the example with Renderdoc it has the same problem as running it without Renderdoc ( which makes sense ):

image

And when I view the renderbuffer contents after the initial clear in renderdoc, it shows the clear color like it shows in the render, lighter than it should be:

image

The clear color shows the same in the draw step, all the way until the final framebuffer blit, where it is dark enough ( but still with pixelated edges ):

image

Yet, when I hover over the pixels in the image, the little thumbnail at the bottom shows the wrong ligher color:

image

Very strange.

I found that I could get rid of the pixelated edges by forcing the renderbuffer pixel format to be RGBA8, but the color was still off. I think thats the closest lead I have and I'm going to look into how different pixel formats efffect it, and maybe try binding the framebuffer storage to a texture instead of a renderbuffer and see if that makes any difference.


PS: Very excited that the new GL backend on wgpu-hal is working for all the examples! This is the first time I've tried it that the shadow, boids, and skybox examples have worked. I might try to tackle #1617, but I figured I'd try to get this one out of the way first. :)

Repro steps
Run the cube or shadow example with the OpenGL backend.

Expected vs observed behavior
There should be no pixelated edges around objects and the clear color should be darker.

Expected:
image

Actual ( I think it's fine if the lines for the trangles aren't there, the issue is the background color ) :
image

Extra materials

Shadow example:
image

Platform
Running WGPU cube or shadow examples on Linux Pop!_OS ( Ubuntu ) 20.04. Adapter info:

[2021-07-10T16:54:11Z INFO  wgpu_hal::gles::adapter] Vendor: Intel
[2021-07-10T16:54:11Z INFO  wgpu_hal::gles::adapter] Renderer: Mesa Intel(R) UHD Graphics (CML GT2)
[2021-07-10T16:54:11Z INFO  wgpu_hal::gles::adapter] Version: OpenGL ES 3.2 Mesa 20.0.8
[2021-07-10T16:54:11Z INFO  wgpu_hal::gles::adapter] SL version: OpenGL ES GLSL ES 3.20
@zicklag
Copy link
Contributor Author

zicklag commented Jul 10, 2021

I'm just realizing that the lighter color of the clear color is actually because the entire render is actually lighter than it should be, not just the clear color.

By forcing the pixel format of the renderbuffer to RGB8, that fixes the pixelated edges ( that can't be a final fix, but it gives us a clue ). Then when I do a renderdoc capture and I got to the last step of the frame, I can actually save the texture for the final image and it looks exactly how we want it, but when we run the actual example in the window, it looks wrong. I wonder if the color problem is a compositing thing, and the pixelated edges is a pixel format thing.

Export From Renderdoc:

test

Screenshot of app ( for the same capture as above renderdoc export ):

image


Edit: Additional note, the pixelated edges seems to be related to the alpha channel of the framebuffer. The reason changing the pixel format fixed the issue is because I took out the alpha channel. Another way to fix it that works is to set gl.color_mask(true, true, true, false) to disable the alpha channel write on the draw framebuffer.

@zicklag
Copy link
Contributor Author

zicklag commented Jul 10, 2021

Hey, I think I might have gotten it! #1628. Not 100% sure it's a final solution, but we can continue discussion in the PR ( I put extra comments in there ).

image

@cwfitzgerald
Copy link
Member

So this is actually a long standing bug on Intel cards on Linux with srgb. I'm not sure how we actually should be working around this reasonably, but it likely needs to be internal with shader rewriting. I'll take a look at your pr later tonight.

@kvark
Copy link
Member

kvark commented Jul 11, 2021

@cwfitzgerald do you have a reference to the bug somewhere? Is this related to COLORSPACE_SRGB, or to the alpha channel writes that are blocked by #1628 ?

@cwfitzgerald cwfitzgerald added external: driver-bug A driver is causing the bug, though we may still want to work around it type: bug Something isn't working labels Jul 11, 2021
@cwfitzgerald
Copy link
Member

cwfitzgerald commented Jul 11, 2021

@kvark I've never actually filed this as I first hit it back when I was a baby graphics programmer and didn't know how to do such things :) I need to collect the information around it (as I suspect #725 is related) and file it up to the appropriate places. Maybe I'll hit up mesa on IRC now that that's a thing I know I can do.

@zicklag Thank you for filing this! It is a very detailed issue which is always much appreciated!

So a couple housekeeping things first. Do the reftests work correctly on gl on your machine? You can test this by running WGPU_BACKEND=gl cargo test --example <example> -- --test-threads=1. If they do, this indicates it is something related to swapchain shenanigans.

So looking at the images, it looks like what you put as the "expected" image is actually completely missing gamma correction and the erroneous image properly has error correction (though with artifacts). I can reproduce this lack of srgb conversion on my intel/linux machine on both vulkan and GL. I also confirmed on a separate machine that the darker image shows that there is no proper SRGB transformation going on when there is supposed to be.

So this is 100% a driver bug at this point. As a user, you can so srgb conversion in your fragment shader or tonemapping pass with a regular framebuffer and it will work as expected.

There won't be a simple fix for this unfortunately. Basically we're going need to:

  • Detect we're on intel/mesa.
  • Lie about creating an srgb swapchain.
  • Inject srgb conversion into all shaders that are writing to the swapchain.

This has some major hurtles that needs to be crossed first:

  • We need to figure out standard ways for us to implement driver bug workarounds.
  • Naga needs to be able to inject srgb conversion.
  • We need to have the ability to have more than one backend shader program per wgpu pipeline. This is because, at pipeline creation, we don't know if the pipeline is going to be used to rendering to a swapchain or not. We need to decide at pipeline bind time.

@kvark
Copy link
Member

kvark commented Jul 11, 2021

Injecting shader code isn't very practical for this, since we wouldn't be able to blend in proper space, anyway.
So let's reach out to Intel/Mesa and see if there is a better workaround. In the worst case, we can disable presentation entirely (i.e. say that no adapter is compatible to a surface) for this platform.

@cwfitzgerald
Copy link
Member

So the blocking around the triangle is caused by https://gitlab.freedesktop.org/mesa/mesa/-/issues/2565. Can you run with the env flag INTEL_DEBUG=nofc and see if the bug goes away? This isn't a long term solution, but shows that it is the problem.

The issue about vk not properly doing srgb seems unrelated.

@cwfitzgerald
Copy link
Member

@zicklag are you using wayland?

@zicklag
Copy link
Contributor Author

zicklag commented Jul 12, 2021

No, I'm running X11.

PS: I'm going to look through your comments and test out what you suggested probably within the next few hours.

@cwfitzgerald
Copy link
Member

cwfitzgerald commented Jul 12, 2021

Interesting, the original issue was different behavior on wayland.

Thanks for testing!

@zicklag
Copy link
Contributor Author

zicklag commented Jul 12, 2021

Thanks for testing!

No problem! :D

Do the reftests work correctly on gl on your machine?

No. Neither the shadow or cube examples will pass the reftest for OpenGL.

So looking at the images, it looks like what you put as the "expected" image is actually completely missing gamma correction and the erroneous image properly has error correction (though with artifacts).

Oh, that's interesting. Makes sense now that I look at the screenshots in the example dirs. :) The reftests for Vulkan will interestingly still pass despite the sRGB difference. Not sure if that's expected.

Can you run with the env flag INTEL_DEBUG=nofc and see if the bug goes away?

That fixed it! Nice to know that something so simple was actually a driver bug and not something wrong with the code. I didn't write the code, but I couldn't figure out for the life of me why a gl.clear_buffer() would be so weird. Also, the reftests will pass with that environment variable set as well.

image


So, in summary, there are 3 separate issues here, if I understand correctly:

  • Vulkan not doing sRGB conversion on Linux: This is happening for me on my Linux machine
  • GL not doing sRGB conversions on Linux: @cwfitzgerald did you say you were getting the darker image even on GL on your machine? I'm not getting that on my machine for GL. Just for Vulkan.
  • Blocky clear borders on GL: This is because of a mesa bug, and can short term be worked around by setting an environment variable.

That leaves the actionable items to be:

  • Figure out how to ensure sRGB conversion is done for Vulkan and GL on Linux
  • Figure out how we want to workaround the mesa bug on GL

Does that sound about right?

@zicklag
Copy link
Contributor Author

zicklag commented Jul 12, 2021

Another data point I just realized is that in my experiment to port Bevy's latest rendering branch from WGPU 0.9 to WGPU master I found out that everything looked darker, which is probably because of the gamma correction/sRGB conversion problem ( I don't know the difference 😁 )

I could possibly try to do a git bisect to see what commit caused it. Though it might be some massive commit like "switch to wgpu-hal" or something...

Also, this is with the Vulkan backend, GL isn't working yet.

WGPU 0.9:

image

WGPU master:

image

@zicklag
Copy link
Contributor Author

zicklag commented Jul 12, 2021

Yeah, the last time that I can find vulkan gamma working correctly was this commit: 1920606. And the first time I can find it broken ( where it also compiles on Linux ) is this commit: 5578222.

It was the move to using wgpu-hal for vulkan support, so no small, easy update to find. I'll have to try to compare gfx's vulkan backend and how wgpu-hal differs in the instance creation or something.

@cwfitzgerald
Copy link
Member

No. Neither the shadow or cube examples will pass the reftest for OpenGL.

Could you upload what the files wgpu/examples/<name>/screenshot-actual.png and screenshot-difference.png are for the failing tests? This will show why they are failing. Both should pass on the current GL impl.

The reftests for Vulkan will interestingly still pass despite the sRGB difference. Not sure if that's expected.

Yeah I think this is a swapchain thing, the reftests are headless so don't deal with the swapchain.

GL not doing sRGB conversions on Linux

Yeah my haswell rig has srgb issues on both vk and gl.

Does that sound about right?

Yup! Thank you for this amazingly clear issue!

It was the move to using wgpu-hal for vulkan support

Yeah I kinda feared that. Thanks for taking a look at the difference. I'm surprised this is only manifesting on intel though.

@zicklag
Copy link
Contributor Author

zicklag commented Jul 12, 2021

Could you upload what the files wgpu/examples//screenshot-actual.png and screenshot-difference.png are for the failing tests? This will show why they are failing. Both should pass on the current GL impl.

Yep. It's just failing because of the intel fastclear bug. When passing INTEL_DEBUG=nofc it works fine.

( screenshot-actual, screenshot-difference )

Yeah my haswell rig has srgb issues on both vk and gl.

Interesting. Could it have to do with the EGL version, maybe? It seems that sRGB is only enabled for egl > 1.5:

if inner.version >= (1, 5) {
// Always enable sRGB in EGL 1.5
attributes.push(egl::GL_COLORSPACE as usize);
attributes.push(egl::GL_COLORSPACE_SRGB as usize);
}

@cwfitzgerald
Copy link
Member

oh. yeah that would do it XD

This is gonna be fun to try to work around, though GL already relies on a blit, so it should be possible.

@cwfitzgerald
Copy link
Member

So I can reproduce the srgb issue on intel/vulkan/windows, so issue 1 is definitely a vulkan backend issue.

@zicklag
Copy link
Contributor Author

zicklag commented Jul 13, 2021

So I can reproduce the srgb issue on intel/vulkan/windows, so issue 1 is definitely a vulkan backend issue.

Yeah, I'm still looking around for what might cause it, but I haven't found anything yet. The swapchain creation looks nearly exactly the same as the gfx backend, so I'm looking around other places but nothing has stood out to me.


Let me know if you want to open separate issues for these, I don't care either way, but on the topic of the Intel fastclear bug, from my tests we can avoid it by setting the environment variable inside the GL backend at instance creation, but unfortunately the environment variable needs to be set before we load the X11 ( and I'm assuming wayland, too ) display. This means we can't do it while enumerating adapters, which is where it would be nicest to do it because we could wait to add the environment variable until we found out that they were actually using an Intel Mesa adapter.

Do you think it's safe enough just to set the INTEL_DEBUG=nofc variable at instance creation for now and do that regardless of what device you are using?

@cwfitzgerald
Copy link
Member

Yeh we probably should split out the bugs into separate issues, this one could be for the fast clear issue. If you can do it, that'd be great, otherwise I'll do it a bit later today.

Do you think it's safe enough just to set the INTEL_DEBUG=nofc variable at instance creation for now and do that regardless of what device you are using?

Yeah this should be fine, as long as we always set it on linux. It shouldn't affect any other adapters as it's an env only for the intel cards. This is a fine (if slow) short term solution, but we should work to see if we can find a way to prevent the bug from occurring at all because this is pessimising intel cards w/o the bug.

zicklag added a commit to katharostech/wgpu that referenced this issue Jul 13, 2021
@zicklag zicklag changed the title Clear Color Rendering Strangely on OpenGL Fastclear Bug With Intel Mesa Adapters Jul 13, 2021
@zicklag
Copy link
Contributor Author

zicklag commented Jul 13, 2021

Great. I opened #1645 with the short-term solution, and I renamed this issue to be specific to the fastclear bug. I'll open a new issue for the Vulkan sRGB bug.

@zicklag zicklag changed the title Fastclear Bug With Intel Mesa Adapters Fastclear Bug With Intel Mesa Adapters on the GL Backend Jul 13, 2021
bors bot added a commit that referenced this issue Jul 13, 2021
1645: Disable Intel Fastclear in GL Backend r=kvark a=zicklag

This works around a Mesa bug on Intel cards:

- https://gitlab.freedesktop.org/mesa/mesa/-/issues/2565
- #1627 (comment)

**Connections**
Related to pixelated edges in GL backend brought up in #1627.

**Description**
This just adds the `INTEL_DEBUG=nofc` environment variable setting when creating an `Instance` using the GL backend in `wgpu_hal`. This is just a workaround until the mesa bug is fixed.

I wanted to wait until adapters were enumerated to determine that the user wanted to use an Intel Mesa adapter, but the environment variable has to be set before the x11 display was opened so that wasn't an option.

This may not be the strategy we want to take with this one, but it seems relatively harmless. Because the environment variable prefixed with `INTEL` anyway, it might not effect devices other than the ones we want to, which would be good.

**Testing**
I tested this on Ubuntu 20.04 with Mesa Intel(R) UHD Graphics (CML GT2) on using the GL backend.


Co-authored-by: Zicklag <zicklag@katharostech.com>
zarik5 pushed a commit to zarik5/wgpu that referenced this issue Jul 17, 2021
@cwfitzgerald
Copy link
Member

cwfitzgerald commented Jul 22, 2021

So I think we should do the following for dealing with the fastclear bug:

  1. We should counter this by doing an explicit shader clear when do you a clear on affected hardware (run a fullscreen triangle that just outputs the clear color). This is what a slow clear is actually doing (and why fastclears are so much faster) so should be no worse.
  2. We can tell if WebGL is affected by using the WEBGL_debug_renderer_info extension to get UNMASKED_RENDERER_WEBGL and UNMASKED_VENDOR_WEBGL. We should be using this to get the device/vendor information anyway. You can check this on your own hardware though https://webglreport.com/?v=2. If this information we should fall back on the normal information, though it would be completely un-helpful. If it's not there, we're SOL for fixing this bug.

@zicklag
Copy link
Contributor Author

zicklag commented Jul 22, 2021

Just finished taking both of those measures and it now succesfully works around the bug on both desktop and WebGL. 🎉 Currently it's in my WebGL branch. I'm not sure if it's helpful or not, but let me know if you want me to split it out to a separate PR for just the desktop fastclear fix instead of leaving it merged with WebGL.

@cwfitzgerald
Copy link
Member

Amazing! Yes splitting it out would be preferred, it'll make it easier to review.

@zicklag
Copy link
Contributor Author

zicklag commented Jul 23, 2021

There you go! Created #1717. It turned out a little messier than I had hoped because, in order to draw the triangle for the shader clear, I had to add a bunch of boolean state values to keep track of whether gl::DEPTH_TEST and friends were currently enabled or disabled so that I could re-set those values back to whatever they were after disabling them all to draw the triangle.

I'm not sure if there's a simpler way to do that, but it's all I could come up with. Let me know if you have any ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: gles Issues with GLES or WebGL external: driver-bug A driver is causing the bug, though we may still want to work around it type: bug Something isn't working
Projects
None yet
4 participants