-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMAA2 integration #655
CMAA2 integration #655
Conversation
initial implemetation with following compromises: -redundant copies, since the algorithm is working with UAV in-place -edge detection happens in LDR; instead HDR luminance diff could be created during tonemapping
tonemapping changed to compute when CMAA2 is on which allows to save perf by not copying tonemap RT into CMAA2 uav (since CMAA2 has effect only on some pixels, where complex or simple shapes are)
There's a problem with too dark HDR linear, so that edges are not detected with current threshold values
This reverts commit 5f977fc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have had only quick look yet. Some overall notes:
- Do we really need so many quality presets? Can we simplify code a bit by selection only one or two that do fit well?
- Definitely need to remove MSAA cases, MSAA is anti-aliasing by itself and also not supported in engine.
- Need to update submodules..
Also good finding in shader/copy_img.comp
!
shader/lighting/tonemapping.frag
Outdated
@@ -10,6 +10,32 @@ | |||
|
|||
#include "upscale/lanczos.glsl" | |||
|
|||
#if defined(COMPUTE) | |||
const uint THREADGROUP_SIZE = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to have a constant for THREADGROUP_SIZE
? gl_WorkGroupSize
should be just fine.
#include "cmaa2_common.glsl" | ||
|
||
layout(binding = 8) buffer UboWorkingExecuteIndirectBuffer { | ||
uint g_workingExecuteIndirectBuffer[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add explicit memory layout: std430
.
Also why an array? Would it be nicer to have:
uvec3 shapeCandidateIndirectArg;
?
|
||
#if CMAA2_EXTRA_SHARPNESS | ||
const float c_dampeningEffect = 0.11; | ||
#define g_CMAA2_LocalContrastAdaptationAmount 0.15f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why sometimes use const
and sometimes #define
?
Note: generally const
is preferred in opengothic, when possible.
void main() { | ||
uvec3 groupID = gl_WorkGroupID; | ||
|
||
if(groupID.x == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here gl_WorkGroupID
is uses as essentially a flag. Better to just use push-constant.
shader/CMakeLists.txt
Outdated
@@ -196,6 +196,9 @@ add_shader(tonemapping.vert copy.vert -DHAS_UV) | |||
add_shader(tonemapping.frag lighting/tonemapping.frag) | |||
add_shader(tonemapping_up.frag lighting/tonemapping.frag -DUPSCALE) | |||
|
|||
add_shader(tonemapping.comp lighting/tonemapping.frag -S comp -DCOMPUTE) | |||
add_shader(tonemapping_up.comp lighting/tonemapping.frag -S comp -DCOMPUTE -DUPSCALE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have need anti-aliasing + upscale together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In FSR1 documentation there's a note that it's better to use antialiased input to FSR1 and we have the same Lanczos filter for upscaling. But since any MLAA algorithm doesn't provide additional samples unlike TAA or MSAA, it seems that edge reconstruction is just a reverse process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove tonemapping_up
then
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About compy_img.comp: would you mind if I'll commit fix for local_size_y
separately?
Asking as review of PR might take a bit of time.
game/graphics/renderer.cpp
Outdated
static bool isFirstRun = true; | ||
// initialization that is needed only on the first run. Make it run only in the first run | ||
if (isFirstRun) { | ||
cmd.setUniforms(*cmaa2.prepareDispatchIndirectArguments, cmaa2.prepareDispatchIndirectArgumentsUbo, &processCandidatesSetupFlag, sizeof(uint32_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line feel a bit wordy, maybe prepareDispatchIndirectArguments
-> indirectArgs
?
} | ||
|
||
uint PackFloat32AsFloat16AndConvertToUint(float arg) { | ||
return packHalf2x16(vec2(arg, 0.)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why introduce proxy function, instead of calling packHalf2x16
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To not repeat packHalf2x16(vec2(arg, 0.)) for each channel and make it like f32tof16 in HLSL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably name should be something ala packHalf1x16
, to be consistent with existing GLSL functions?
shader/lighting/tonemapping.frag
Outdated
#if defined(COMPUTE) | ||
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in; | ||
layout(binding = 2) uniform writeonly image2D tonemappedOutput; | ||
layout(r32f, binding = 3) uniform writeonly image2D hdrLumaOutput; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writeonly
doesn't require explicit pixel-format specifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also compute variant can be removed, if you segregate sceneTonemapped
(pre CMAA) and anti-aliased sceneTonemapped
, that is written by deferred_color_apply_2x2.comp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's the same sceneTonemapped
as used in cmaa2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cmaa2_common.glsl;
layout(binding = 0) uniform sampler2D inputColorReadonly;
- not a storage-image, but a simple texture. Also good to rename it to sceneTonemapped
, in order to math with c++ side.
In deferred_color_apply_2x2.comp outputColor
is indeed a writable image, but you dont have to write into a same one.
Maybe deferred_color_apply_2x2 can later be converted into fragment-shader? It's not heavy one compute features, other then a unordered-write.
About copy_img.comp, yeah, it's better to commit separately |
layout(r32ui, binding = 6) uniform uimage2D workingDeferredBlendItemListHeads; | ||
|
||
layout(binding = 7) buffer UboWorkingControlBuffer { | ||
uint workingControlBuffer[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem to be part of porting HLSL code. Based on use-cases:
workingControlBuffer[3]; // numCandidates, workingControlBuffer
workingControlBuffer[4]; // shapeCandidateCount
workingControlBuffer[8]; // blendLocationCount, edgeListCounter
workingControlBuffer[12]; // counterIndex
Haven't found other array elements to be in use. Maybe due to removal of extra variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, extra variants didnt require additional counters. I suppose counters are aligned for faster access. 3 and 4 are close probably because they are used together in the indirect setup pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a publication about atomics performance profile: https://www.youtube.com/watch?v=VaE_uKPfjv0
I don't remember details now, yet my takeaway was that only single variable case is fast-path. Anything else - unpredictable across multiple gpu vendors.
In any case no point of using array, if you really need padding - use struct:
buffer {
uint numCandidates;
uint shapeCandidateCount;
uint padding0[3];
uint blendLocationCount;
...
}
game/graphics/renderer.cpp
Outdated
@@ -187,7 +214,9 @@ void Renderer::initSettings() { | |||
|
|||
auto prevVidResIndex = settings.vidResIndex; | |||
settings.vidResIndex = Gothic::inst().settingsGetF("INTERNAL","vidResIndex"); | |||
settings.fxaaEnabled = (Gothic::options().fxaaPreset > 0) && (settings.vidResIndex==0); | |||
settings.cmaa2Enabled = (Gothic::options().cmaa2Preset>0) && (settings.vidResIndex==0); | |||
settings.fxaaEnabled = (Gothic::options().fxaaPreset>0) && (settings.vidResIndex==0) && !settings.cmaa2Enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can just remove FXAA and also remove explicit name of the technique from command-line. Just -aa 1
, instead of -cmaa2 1
.
#define CMAA2_UAV_STORE_TYPED | ||
#define CMAA2_UAV_STORE_TYPED_UNORM_FLOAT | ||
|
||
#ifdef CMAA2_UAV_STORE_TYPED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably can keep only code relevant to CMAA2_UAV_STORE_TYPED_UNORM_FLOAT
:
rgba32f
is overkill, probably wont use it for RT anytime soon;
r32ui
(untyped) implies VK_KHR_image_format_list
(or similar extension), that we do not support
shader/CMakeLists.txt
Outdated
@@ -196,6 +196,9 @@ add_shader(tonemapping.vert copy.vert -DHAS_UV) | |||
add_shader(tonemapping.frag lighting/tonemapping.frag) | |||
add_shader(tonemapping_up.frag lighting/tonemapping.frag -DUPSCALE) | |||
|
|||
add_shader(tonemapping.comp lighting/tonemapping.frag -S comp -DCOMPUTE) | |||
add_shader(tonemapping_up.comp lighting/tonemapping.frag -S comp -DCOMPUTE -DUPSCALE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove tonemapping_up
then
} | ||
|
||
uint PackFloat32AsFloat16AndConvertToUint(float arg) { | ||
return packHalf2x16(vec2(arg, 0.)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably name should be something ala packHalf1x16
, to be consistent with existing GLSL functions?
shader/lighting/tonemapping.frag
Outdated
#if defined(COMPUTE) | ||
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in; | ||
layout(binding = 2) uniform writeonly image2D tonemappedOutput; | ||
layout(r32f, binding = 3) uniform writeonly image2D hdrLumaOutput; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cmaa2_common.glsl;
layout(binding = 0) uniform sampler2D inputColorReadonly;
- not a storage-image, but a simple texture. Also good to rename it to sceneTonemapped
, in order to math with c++ side.
In deferred_color_apply_2x2.comp outputColor
is indeed a writable image, but you dont have to write into a same one.
Maybe deferred_color_apply_2x2 can later be converted into fragment-shader? It's not heavy one compute features, other then a unordered-write.
-indirect buffer size -naming -removed extra output formats
@KirillAlekseeenko Pr generally looks good. Last few changes can do by myself. Let me know, when you happy with current state of PR, so I can start testing/finalizing |
Seems that it's ready for integration |
Experimenting with edge detection in HDR space:
In LDR path, I've changed encoding from
Computing luminance in flight via At this point I'm thing to make HDR as default and only path, since it fits better into existing pipeline (0.06ms performance saved on RTX, compare to LDR!) and some VRAM bandwidth savings. @KirillAlekseeenko let me know what you think about it. |
It works fine but there's two-three times less candidate pixels than with LDR pass, maybe it's worth trying adaptive edge detection threshold based on exposure value |
Probably no, at least I don't see how it can be better. Right now input is pre-exposed, close to [0..1] range. Maybe raise |
Seems that there's no need to raise threshold, I just took a look at nsight trace and it's even faster than fxaa |
Merged, thanks! |
-Integration implemented with different quality modes (higher mode means lower edge detection threshold)
-Tonemaping switches to compute when CMAA2 is on in order to avoid additional copy from RT resource to UAV.
-Created additional HDR luminance resource, but currently it's not used, since there is a problem, that in some locations HDR range is so small, that almost no edges are detected even on max quality preset.