Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in Monarch_::GetPID #12774

Closed
zadjii-msft opened this issue Mar 28, 2022 · 6 comments · Fixed by #12825 or #12856
Closed

Crash in Monarch_::GetPID #12774

zadjii-msft opened this issue Mar 28, 2022 · 6 comments · Fixed by #12825 or #12856
Assignees
Labels
Area-Remoting Communication layer between windows. Often for windowing behavior, quake mode, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-1 A description (P1) Product-Terminal The new Windows Terminal. Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. zAskModeBug zInbox-Bug Ignore me!

Comments

@zadjii-msft
Copy link
Member

This one's wild, because it don't make NO sense.

MSFT:38540483

Some verbatim Team notes, because searching teams is impossible and I won't be able to find this again

okay so second dump from bottom

0:000> lmDvmWindowsTerminal
Browse full module list
start             end                 module name
00007ff6`75e00000 00007ff6`75e75000   WindowsTerminal   (no symbols)           
    Loaded symbol image file: WindowsTerminal.exe
    Mapped memory image file: C:\ProgramData\Dbg\sym\WindowsTerminal.exe\622F96A975000\WindowsTerminal.exe
    Image path: C:\Program Files\WindowsApps\Microsoft.WindowsTerminal_1.12.10733.0_x64__8wekyb3d8bbwe\WindowsTerminal.exe
    Image name: WindowsTerminal.exe
    Browse all global symbols  functions  data
    Timestamp:        Mon Mar 14 14:25:29 2022 (622F96A9)
    CheckSum:         00071A4E
    ImageSize:        00075000
    File version:     1.12.2203.14003
    Product version:  1.12.2203.14003
    File flags:       0 (Mask 3F)
    File OS:          4 Unknown Win32
    File type:        1.0 App
    File date:        00000000.00000000
    Translations:     0000.04b0
    Information from resource tables:
0:000> lmDvmMicrosoft_Terminal_Remoting
Browse full module list
start             end                 module name
00007ffe`ece80000 00007ffe`ecec3000   Microsoft_Terminal_Remoting   (private pdb symbols)  C:\ProgramData\Dbg\sym\Microsoft.Terminal.Remoting.pdb\A3DE2F1865ED46C490034B3D23F05EAE1\Microsoft.Terminal.Remoting.pdb
    Loaded symbol image file: Microsoft.Terminal.Remoting.dll
    Mapped memory image file: C:\ProgramData\Dbg\sym\Microsoft.Terminal.Remoting.dll\622F940D43000\Microsoft.Terminal.Remoting.dll
    Image path: C:\Program Files\WindowsApps\Microsoft.WindowsTerminal_1.12.10733.0_x64__8wekyb3d8bbwe\Microsoft.Terminal.Remoting.dll
    Image name: Microsoft.Terminal.Remoting.dll
    Browse all global symbols  functions  data
    Timestamp:        Mon Mar 14 14:14:21 2022 (622F940D)
    CheckSum:         00047923
    ImageSize:        00043000
    File version:     1.12.2203.14003
    Product version:  1.12.2203.14003
    File flags:       0 (Mask 3F)
    File OS:          4 Unknown Win32
    File type:        2.0 Dll
    File date:        00000000.00000000
    Translations:     0000.04b0
    Information from resource tables:

in WindowManager::_createMonarchAndCallbacks, _monarch isn't null in WindowManager::WindowManager, there's a try/catch(...) around _createMonarchAndCallbacks(). So even if that did A/V in GetPID, it should be caught (then try again)

modules look like the same version

hmmmm

but you can't catch a null pointer dereference

i was wondering if it was a torn version thing insofar as the monarch is running one version and the new executable is a different one

and it tried to create a monarch by clsid, got the old one, failed to talk to it

but that wouldn't be a A/V, would it?

would it

thats where i get scared. here's why

In one dump, i went up a few stack frames?

_monarch->m_ptr was NOT null

yea i don't see any null monarchs

and then i got to wondering

if cppwinrt-Monarch internally QIs for IMonarch and gets null, will it blow up if it hits that line

the autogenerated code doesn't show up in the debugger

#define WINRT_IMPL_SHIM(...) (*(abi_t<__VA_ARGS__>**)&static_cast<__VA_ARGS__ const&>(static_cast<D const&>(*this)))
check_hresult(WINRT_IMPL_SHIM(winrt::Microsoft::Terminal::Remoting::IMonarch)->GetPID(&result));(*(abi_t<winrt::Microsoft::Terminal::Remoting::IMonarch>**)&static_cast<winrt::Microsoft::Terminal::Remoting::IMonarch const&>(static_cast<D const&>(*this)))D is winrt::Microsoft::Terminal::Remoting::Monarchso
(*(abi_t<winrt::Microsoft::Terminal::Remoting::IMonarch>**)&static_cast<winrt::Microsoft::Terminal::Remoting::IMonarch const&>(static_cast<winrt::Microsoft::Terminal::Remoting::Monarch const&>(*this)))
(
    *(abi_t<winrt::Microsoft::Terminal::Remoting::IMonarch>**)
    &static_cast<winrt::Microsoft::Terminal::Remoting::IMonarch const&>(
        static_cast<winrt::Microsoft::Terminal::Remoting::Monarch const&>(*this)
    )
)

so maybe the old one wasn't a IMonarch

but still this would go away once folks are settled

from a cold start this shouldn't be hittable, only in the middle of an update, if that's really the truth of the matter

could we... test this?

that's my understanding

i don't know how to get it to partially stage an update

that's my big issue

that would help us test the FONT thing, the DLL thing, and now this

OH WAIT

OH OH WAIT

launch a dev build of the Terminal that doesn't have the IMonarch change, then swap out Remoting.dll/winmd and do it again?

or just

does unpackaged glom with packaged?

does unpackaged glom with unpackaged?

.....................................

if so, yeah, just unpack 10732 and 10532

unpackaged gloms to, well

hm

gloms to packaged?

so an unpackaged 10532 talking with a packaged 10732

other way perhaps

since 10732 talks IMonarchese

right

okay I can test that (in the morning)

I had 5 minutes spare

so I tested it

In the torn version state, it's much worse than I expected! It logs E_NOINTERFACE thousands and thousands of time in a tight infinite loop

@zadjii-msft zadjii-msft added Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Terminal The new Windows Terminal. Priority-1 A description (P1) Area-Remoting Communication layer between windows. Often for windowing behavior, quake mode, etc. zInbox-Bug Ignore me! labels Mar 28, 2022
@zadjii-msft zadjii-msft added this to the Terminal v1.14 milestone Mar 28, 2022
@ghost ghost added the Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting label Mar 28, 2022
@ghost ghost added the In-PR This issue has a related PR label Apr 4, 2022
@ghost ghost closed this as completed in #12825 Apr 4, 2022
ghost pushed a commit that referenced this issue Apr 4, 2022
This is a crazy idea Dustin and I had.

> we can't repro this at will. But we kinda have an idea of where the deref is. We don't know if the small patch (throw, and try again) will fix it. We're sure that the "just fall back to an isolated monarch" will work. I'd almost rather take a build testing the small patch first, to see if that works

> This might seem crazy
> in 1.12, isolated monarch. In 1.13, "small patch". In 1.14, we can wait and see

I can write more details in the morning. It's 5pm here so if we want this today, here it is.

@DHowett double check my velocity flag logic here. Should be always true for Release, and off for Dev, Preview. 

* [x] closes #12774
@ghost ghost added Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. and removed In-PR This issue has a related PR labels Apr 4, 2022
DHowett pushed a commit that referenced this issue Apr 4, 2022
This is a crazy idea Dustin and I had.

> we can't repro this at will. But we kinda have an idea of where the deref is. We don't know if the small patch (throw, and try again) will fix it. We're sure that the "just fall back to an isolated monarch" will work. I'd almost rather take a build testing the small patch first, to see if that works

> This might seem crazy
> in 1.12, isolated monarch. In 1.13, "small patch". In 1.14, we can wait and see

I can write more details in the morning. It's 5pm here so if we want this today, here it is.

@DHowett double check my velocity flag logic here. Should be always true for Release, and off for Dev, Preview.

* [x] closes #12774

(cherry picked from commit 446f280)
Service-Card-Id: 80154585
Service-Version: 1.13
DHowett pushed a commit that referenced this issue Apr 4, 2022
This is a crazy idea Dustin and I had.

> we can't repro this at will. But we kinda have an idea of where the deref is. We don't know if the small patch (throw, and try again) will fix it. We're sure that the "just fall back to an isolated monarch" will work. I'd almost rather take a build testing the small patch first, to see if that works

> This might seem crazy
> in 1.12, isolated monarch. In 1.13, "small patch". In 1.14, we can wait and see

I can write more details in the morning. It's 5pm here so if we want this today, here it is.

@DHowett double check my velocity flag logic here. Should be always true for Release, and off for Dev, Preview.

* [x] closes #12774

(cherry picked from commit 446f280)
Service-Card-Id: 80154584
Service-Version: 1.12
@DHowett
Copy link
Member

DHowett commented Apr 8, 2022

Okay, so this isn't your run-of-the-mill failure. Here's what I've learned.

  1. We are successfully getting a COM proxy object for _monarch -- therefore, it is not null
  2. The null pointer AV is coming from (effectively) nullptr->GetPID().
  3. _monarch is holding an IUnknown that points at the proxy.
  4. When we call _monarch.GetPID(), C++/WinRT does two things (effectively¹):
    1. QueryInterface(IID_PPV_ARGS(&pMonarchAsIMonarch))
    2. pMonarch->GetPID()
  5. pMonarch is the nullptr from step 2

The proxy object is failing to QI for IMonarch. This really does suggest that this is an issue where the version of Terminal that is trying to launch (1) doesn't agree on what the IID for IMonarch is with the running version, or (2) has gone terribly wrong somewhere in COM land.

Checking whether _monarch itself is populated will never tell us if it's a live object. We could try to try_as IMonarch ourselves and check that instead. However, it looks like this crash usually happens in defterm and causes a fallback to conhost. 🤷🏻

What I can't figure out is what the heck the server is doing when it gets this call. We clearly have a handle to a remote object, but I have no idea who that object is or why it doesn't QI. We can trivially reproduce this crash by having a custom COM server simply camp out with a class factory object on the Monarch's CLSID.


¹ This comes in through an automatic try_as²
² Actually, this makes every Monarch call twice as expensive -- it QIs (over RPC) for IMonarch and then calls the function (over RPC again.)³
³ This is partially because the default interface for the Monarch runtimeclass is actually IMonarch2, which is an empty interface that does nothing⁴ but fill space/have a GUID.
⁴ See that [default_interface] on runtimeclass Monarch? If we omitted that, IMonarch would be the default interface. Because we specified it, MIDL3 generates an empty interface to be the default interface.

We can ameliorate footnote 4 by removing default_interface. We can address footnote 2 by using the IMonarch projection internally. Idly, I am wondering if doing so would move the failure up into _createMonarch where we could understand it, or work around it.

@DHowett DHowett reopened this Apr 8, 2022
@zadjii-msft
Copy link
Member Author

We can ameliorate footnote 4 by removing default_interface.

We should probably do that regardless. We don't really need that since this isn't an empty class.

We can address footnote 2 by using the IMonarch projection internally

That's easy enough


Of course, if this gets another week deeper into ask mode, we could always just yank IMonarch and the test that depends on it entirely. One fewer test is not the end of the world.

@zadjii-msft zadjii-msft self-assigned this Apr 8, 2022
@zadjii-msft zadjii-msft removed Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Apr 8, 2022
zadjii-msft added a commit that referenced this issue Apr 8, 2022
This is all of course, conjecture. This crash is totally wild and makes no sense at all. But, we're hoping that this fixes it. This should also make calls to the Monarch a little easier.

You may be asking yourself - why aren't I doing this for the Peasant too? Well, because the Peasant simply doesn't crash like the monarch does. I'm not gonna touch something that's not broken _during ask mode_.

* [x] Closes #12774, hopefully.
* [x] tests pass.
@ghost ghost added the In-PR This issue has a related PR label Apr 8, 2022
@ghost ghost added Needs-Tag-Fix Doesn't match tag requirements Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. and removed In-PR This issue has a related PR labels Apr 8, 2022
DHowett pushed a commit that referenced this issue Apr 8, 2022
This is all of course, conjecture. This crash is totally wild and makes no sense at all. But, we're hoping that this fixes it. This should also make calls to the Monarch a little easier.

You may be asking yourself - why aren't I doing this for the Peasant too? Well, because the Peasant simply doesn't crash like the monarch does. I'm not gonna touch something that's not broken _during ask mode_.

References #12774. We can close the bug if it is verified fixed.
DHowett pushed a commit that referenced this issue Apr 8, 2022
This is all of course, conjecture. This crash is totally wild and makes no sense at all. But, we're hoping that this fixes it. This should also make calls to the Monarch a little easier.

You may be asking yourself - why aren't I doing this for the Peasant too? Well, because the Peasant simply doesn't crash like the monarch does. I'm not gonna touch something that's not broken _during ask mode_.

References #12774. We can close the bug if it is verified fixed.

(cherry picked from commit b64fd77)
Service-Card-Id: 80383091
Service-Version: 1.13
DHowett pushed a commit that referenced this issue Apr 8, 2022
This is all of course, conjecture. This crash is totally wild and makes no sense at all. But, we're hoping that this fixes it. This should also make calls to the Monarch a little easier.

You may be asking yourself - why aren't I doing this for the Peasant too? Well, because the Peasant simply doesn't crash like the monarch does. I'm not gonna touch something that's not broken _during ask mode_.

References #12774. We can close the bug if it is verified fixed.

(cherry picked from commit b64fd77)
Service-Card-Id: 80383092
Service-Version: 1.12
@ghost
Copy link

ghost commented Apr 19, 2022

🎉This issue was addressed in #12825, which has now been successfully released as Windows Terminal v1.12.1098.:tada:

Handy links:

@ghost
Copy link

ghost commented Apr 19, 2022

🎉This issue was addressed in #12856, which has now been successfully released as Windows Terminal v1.12.1098.:tada:

Handy links:

@ghost
Copy link

ghost commented Apr 19, 2022

🎉This issue was addressed in #12825, which has now been successfully released as Windows Terminal Preview v1.13.1098.:tada:

Handy links:

@ghost
Copy link

ghost commented Apr 19, 2022

🎉This issue was addressed in #12856, which has now been successfully released as Windows Terminal Preview v1.13.1098.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Remoting Communication layer between windows. Often for windowing behavior, quake mode, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-1 A description (P1) Product-Terminal The new Windows Terminal. Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. zAskModeBug zInbox-Bug Ignore me!
Projects
Status: No status
2 participants