-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure predicate cache is reset when control flow leaves block #4274
Conversation
To be fair, this is not my favourite fix because it risks missing places where the cache should be reset but isn't and it will be a pain to debug cases that fail if you don't keep this issue in the back of your mind. But the predicate cache is only valid during IR generation so we cannot use any of the backend stuff like Spill functions. |
38e7ebb
to
6045a0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting way to fix it, but I'm fine with it.
@@ -164,6 +168,9 @@ void OpDispatchBuilder::FADD(OpcodeArgs, IR::OpSize Width, bool Integer, OpDispa | |||
if (Op->Src[0].IsNone()) { // Implicit argument case | |||
auto Offset = Op->OP & 7; | |||
auto St0 = 0; | |||
if (!ReducedPrecisionMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand why this check is here but I don't quite get it, won't this code only run for full precision anyway? I would have thought reduced would take
void OpDispatchBuilder::FADDF64(OpcodeArgs, IR::OpSize Width, bool Integer, OpDispatchBuilder::OpResult ResInST0) { |
Also just thinking through perhaps it would be cleaner to do this in FlushRegisterCache (!SRAOnly case) and to save the predicate when spilling/filling regs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - that's a great point. My intention was to merge (FADD
and FADDF64
) those and I didn't but thought I did.
I did attempt to do this through FlushRegisterCache
but this one is called at _Break()
. This seems to be called by INTOp
and we don't want to reset the cache in these cases - when I attempted this, it was basically resetting the cache too often, which is why the commit calls ResetInitPredicateCache()
alongside FlushRegisterCache
mostly everywhere but not on Break()
.
I am happy to hear suggestions on how to improve this if you have any ideas. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you mention saving the predicate reg, but if we do managed to reset the predicate register in FlushRegisterCache
then there was no need to save the predicate in spilling/filling regs. The reason for not spilling/filling the predicate register is because afaiu doing this for predicate registers takes a few instructions and would most likely void the optimization that we are attempting to do, so it's better to simply reset the cache and then regenerate the predicate with a single ptrue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you mention saving the predicate reg, but if we do managed to reset the predicate register in
FlushRegisterCache
then there was no need to save the predicate in spilling/filling regs.
Hehe, yeah this is true :)
The reason for not spilling/filling the predicate register is because afaiu doing this for predicate registers takes a few instructions and would most likely void the optimization that we are attempting to do, so it's better to simply reset the cache and then regenerate the predicate with a single
ptrue
.
Right yeah, I missed that we'd have to save all the predicate registers since there's no way to know which have changed - though looking through only one predicate type is ever cached, so the cache could be replaced with statically allocating a single extra predicate register. Then that is always spilled, even if a few such registers are needed in the future I think the cost would be negligible.
I did have some ideas of having the predicate register set to spill be a required argument for Spill/FillSRA and to pass that in as an argument to IROps that end up calling them, but this is definitely beyond what is necessary here
I did attempt to do this through FlushRegisterCache but this one is called at _Break(). This seems to be called by INTOp and we don't want to reset the cache in these cases - when I attempted this, it was basically resetting the cache too often, which is why the commit calls ResetInitPredicateCache() alongside FlushRegisterCache mostly everywhere but not on Break().
I'm quite surprised break/INT is being called that often in x87 code, what ends up causing this? If that can be figured then clearing the predicate cache for all !JITDispatch ops in the IR json (I think this is what you are trying to achieve by manually annotating x87 ops? Correct me if I'm wrong) could work, there's also cases like cpuid, ludiv etc that I think would need to clear the cache but don't currently. I personally prefer the former solution I mentioned since it's nice to have the guarantee that spilling and filling registers in a handler won't cause issues but will leave the choice up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was looking at what's causing INTOp to be called and I am a little bit mistified myself about this. Take the test:
%ifdef CONFIG
{
"RegData": {
"MM7": ["0x8000000000000000", "0x4000"]
}
}
%endif
lea rdx, [rel data]
fld tword [rdx + 8 * 0]
lea rdx, [rel data2]
lea rax, [rdx + 8 * 0]
fstp tword [rax]
fld tword [rdx + 8 * 0]
hlt
align 8
data:
dt 2.0
dq 0
data2:
dt 0.0
dq 0
This calls fld twice (with fstp in-between), you should only need to reset the predicate cache at the beginning of the block. However, here if you call the reset function from FlushRegisterCache, you end up resetting it a couple of times due to INTOp calling it. I have not yet figured out why it's being called but it's related to threading. It comes from Core.cpp:668
:
std::invoke(Fn, Thread->OpDispatcher, DecodedInfo);
The backtrace from ResetInitPredicateCache shows:
Breakpoint 1, FEXCore::IR::IREmitter::ResetInitPredicateCache (this=0x7ff6ef7000)
at /home/steamvr/dev/FEX/FEXCore/Source/Interface/IR/IREmitter.h:77
77 InitPredicateCache.clear();
(gdb) bt
#0 FEXCore::IR::IREmitter::ResetInitPredicateCache (this=0x7ff6ef7000)
at /home/steamvr/dev/FEX/FEXCore/Source/Interface/IR/IREmitter.h:77
#1 0x0000005555723324 in FEXCore::IR::OpDispatchBuilder::FlushRegisterCache (this=0x7ff6ef7000, SRAOnly=false)
at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/OpcodeDispatcher.h:1179
#2 0x0000005555794a34 in FEXCore::IR::OpDispatchBuilder::INTOp (this=0x7ff6ef7000, Op=0x7fe2e00300)
at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/OpcodeDispatcher.cpp:4693
#3 0x000000555573eb6c in std::__invoke_impl<void, void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
__f=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>, __t=...,
__args=@0x7fffffdc20: 0x7fe2e00300) at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/bits/invoke.h:74
#4 0x000000555573eacc in std::__invoke<void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
__fn=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>,
__args=@0x7fffffdc20: 0x7fe2e00300, __args=@0x7fffffdc20: 0x7fe2e00300)
at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/bits/invoke.h:96
#5 0x0000005555723eb8 in std::invoke<void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
__fn=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>,
__args=@0x7fffffdc20: 0x7fe2e00300, __args=@0x7fffffdc20: 0x7fe2e00300)
at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/functional:120
#6 0x000000555572247c in FEXCore::Context::ContextImpl::GenerateIR (this=0x7ff6e57000, Thread=0x7ff6ee3000, GuestRIP=65536,
ExtendedDebugInfo=false, MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:668
#7 0x00000055557248f4 in FEXCore::Context::ContextImpl::CompileCode (this=0x7ff6e57000, Thread=0x7ff6ee3000, GuestRIP=65536,
MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:801
#8 0x0000005555724e84 in FEXCore::Context::ContextImpl::CompileBlock (this=0x7ff6e57000, Frame=0x7ff6ee3090, GuestRIP=65536,
MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:846
Do you understand what's happening? I will keep trying to understand this in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly marked VPCMPESTRX for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ufff, but we probably shouldn't do it for all !JITDispatch instructions right? Some of them could potentially just lower to other instructions that don't need to reset the cache
Yeah right this is just an issue in general, I think in some sense if an instruction is always lowered into something else it should be marked differently to one that uses a fallback. (No need to fix this though, see below)
Whole this work is to enable a relatively small optimization (see #4166 (comment)) and I am wondering if at some point we might have to say that it's better to go without it.
Yeah - this is why I suggested just doing it statically, it's so awkward with all the edge cases when statically allocating the pred reg would be a 10 or so line change and avoid having to deal with any of this and solve the original issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - this is why I suggested just doing it statically, it's so awkward with all the edge cases when statically allocating the pred reg would be a 10 or so line change and avoid having to deal with any of this and solve the original issue
Yep! OK, lets take a step back here and let me refactor this for a statically allocated predicate register solution.
Thanks for your patience and help on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference do see loadpf, you basically want to copy that exactly (Inc ra changes) but taking in predicate init params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will do that.
I will have to travel today but I will work on this through the weekend.
Marking this as draft to avoid merging by accident- need to think about what @bylaws said. |
6d46c75
to
5e0886e
Compare
fa9ff97
to
c8594ba
Compare
Whenever the control float leaves the block, it might clobber the predicate register so we reset the cache whenever that happens. The difficulty here is that the cache is valid only during IR generation so we need to make sure we catch all the cases during this pass where the execution might leave the block. Fixes FEX-Emu#4264
c8594ba
to
9aea249
Compare
Closing in favour of #4292 |
Whenever the control float leaves the block, it might clobber the
predicate register so we reset the cache whenever that happens.
The difficulty here is that the cache is valid only during IR generation
so we need to make sure we catch all the cases during this pass where
the execution might leave the block.
Fixes #4264