Add viewer watchdog process to notice crashes and report. #3498

nat-goodspeed · 2025-02-03T20:14:01Z

replaces #3496 : we need the branch to be named "project/watchdog" so the PR will build.

`LLProcess`, being based on APR process management, depends on polling. Until now, `LLProcess` has listened for per-frame events on the "mainloop" `LLEventPump` to poll child processes. The problem with that is that `LLEventPumps` is cleaned up long before ultimate viewer shutdown. To remain in touch with the watchdog child process until quite close to viewer termination, we need a lower-level callback facility: `LLCallbackList`. Defend `LLCallbackList` against re-entrant `addFunction()` or `deleteFunction()` method calls while `callFunctions()` is traversing the list. Also make `LLProcess` more robust wrt reading final output from child-process pipes when the child terminates. llprocess_test.cpp and llleap_test.cpp must now pump `LLCallbackList` instead of the "mainloop" `LLEventPump`.

even when gIdleCallbacks isn't being called regularly. Make LLProcessListener::tick() forward control to every BasePipe::tick() method, instead of registering each individually on gIdleCallbacks. This allows LLProcess::tick() to service all related machinery without reference to gIdleCallbacks. That could be important if gIdleCallbacks still contains callbacks at shutdown time that are no longer viable.

That causes trouble if gIdleCallbacks is destroyed before LLProcessListener.

This version only notices viewer termination and distinguishes normal termination from a crash. It also opens a watchdog.log in the viewer's logs directory so we can tell. So far it logs successful termination as well as detected crashes, but we'll eventually remove success-case logging. Add an `LLProcessPtr mWatchdog` to `LLAppViewer`. Early in `LLAppViewer()`, launch the watchdog process, passing it pipes for its stdin, stdout, stderr. Of the three, the one that matters most is stdin: it's essential that the watchdog receive a pipe from the viewer. That pipe will be closed by the OS on viewer termination, one way or another. Then in `~LLAppViewer()`, if the watchdog has been started and is still running, write "OK" to it to reassure it that the viewer is terminating normally. Make viewer_manifest.py copy watchdog.py into the built viewer image. Introduce Python's `pathlib.Path` into viewer_manifest.py, and defend llmanifest.py against `Path` instances where it expects strings. Clean up some redundancy in viewer_manifest.py.

This deals uniformly with string and Path values.

Also, if we see stderr from watchdog.py, trim trailing newline.

Include new indra/watchdog/CMakeLists.txt, that builds `watchdog` executable, as part of CMake configuration. Make secondlife-bin depend on `watchdog_exe`. Instead of copying `watchdog.py`, make viewer_manifest.py copy the `watchdog` executable. Similarly, instead of running `python watchdog.py`, make `LLAppViewer` directly run the `watchdog` executable. Install PyInstaller as part of the Python dependencies on the GH runner.

This should be useful for any file copy command that differs between platforms only in whether or not the executable ends with ".exe". Also introduce lib_sfx for similar reasons.

Invoke the experimental project/watchdog branch of viewer-build-util, each of whose build-pkg-{windows,mac}/action.yaml scripts now accepts parameters specifying the list of files to sign. Add the watchdog executable for each platform.

The copied logic replaced: ``` cd _artifacts tar -xJf ... ``` with the one-liner: ``` tar -xJf ... -C _artifacts ``` The trouble is that `tar`'s `-C` switch is order-sensitive: it affects the options that *follow* it on the command line.

It needs the "hardened runtime" option.

Somehow the viewer_channel string 'Second Life Project Watchdog' is getting munged between the build job and the post-windows-symbols job to 'Second Life Project watchdog'. Trying to use that viewer_channel to untar the downloaded 'Second Life Project Watchdog.sym.tar.xz' fails.

This is to disambiguate reports sent multiple times for the same viewer run, for crashes without a session ID. Also add `ClientInfo/Channel` key to static_debug_info.log. This is identical to `ClientInfo/Name`, but it might not be obvious to consumers of this data packet that `Name` here means viewer channel. Nonetheless we leave `Name` in place because it's likely that some existing analysis _does_ consult `Name`. Also add `ClientInfo/Version` key. `ClientInfo` already contains `MajorVersion`, `MinorVersion`, `PatchVersion` and `BuildVersion`, but why make the consumer concatenate?

Add to the viewer's logs directory a crashes.json file containing a JSON array of previous crash reports. If that file exists when the watchdog starts up, try to send its contents right away. Stop logging successful viewer termination. On detecting viewer crash, try to read static_debug_info.log as the basis of the new crash report blob. If we can find SecondLife.log, read its creation time, which we set as "Date", the start of the viewer run. Subtract from current time to get "Duration". Adapt logic from viewer-manager aka SLVersionChecker to get a hash of some sort for "MachineID", to group crash reports from the same computer. If we had trouble with any of the above, also record those "Errors". The new crash blob consists of an array of previous crashes plus the current report. Update crashes.json before attempting to send. If the send succeeded, delete crashes.json again. So far the "send" operation merely logs the JSON array of crashes.

As previously expressed, powershell perplexingly echoed its -Command argument to stdout, instead of executing it.

For a non-BugSplat build, the static_debug_info.log was being written to a dump-UUID subdirectory of the main logs directory. Keep that particular file in the main logs directory so the watchdog process can find it.

Turns out that on Windows, using st_birthtime or st_ctime gets you a timestamp going back to the first viewer run on that filesystem. Use st_mtime, which is current. But use st_mtime of SecondLife.start_marker instead of SecondLife.log: the latter's st_mtime will be at the end of the run, not the beginning. SecondLife.start_marker exists to record the beginning of a run.

We don't want watchdog.log to grow without bound.

nat-goodspeed added 14 commits January 29, 2025 17:10

Add a comment with lldb magic to print callstack for crashing test.

4666872

Don't unregister LLProcessListener in its destructor.

000fba3

That causes trouble if gIdleCallbacks is destroyed before LLProcessListener.

Fix irksome Python syntax warning on yet another branch.

4ffd81f

Avoid blinkin MS global macros stdin, stderr.

6d09281

Use len(Path.parents) instead of str.count('/') for path depth.

0e04871

This deals uniformly with string and Path values.

Find watchdog.py even running in a Windows dev area.

dde6b25

Also, if we see stderr from watchdog.py, trim trailing newline.

Ditch watchdog.py DOS box on Windows.

b5e9304

Introduce ViewerManifest.exe_sfx and use for watchdog executable.

50c426e

This should be useful for any file copy command that differs between platforms only in whether or not the executable ends with ".exe". Also introduce lib_sfx for similar reasons.

Fix a couple comments in llappviewer.cpp.

8418fe4

nat-goodspeed requested a review from vir-linden February 3, 2025 20:14

github-actions bot added c/cpp cmake python llcommon labels Feb 3, 2025

nat-goodspeed mentioned this pull request Feb 3, 2025

Add viewer watchdog process to notice crashes and report. #3496

Closed

nat-goodspeed added 10 commits February 4, 2025 07:59

Fix copy error bringing post-bugsplat-windows/action.yaml inline.

307004c

Get diagnostic output to solve failure to post Windows symbols

bac0d88

Count Mac watchdog executable with apps, not dylibs.

971890a

It needs the "hardened runtime" option.

Finish first pass at adapting Windows machine ID logic.

aae7243

Fix trailing whitespace.

f38c941

Try still harder to fix post-windows-symbols.

fcf919d

nat-goodspeed added 6 commits February 5, 2025 14:06

Restore original post-windows-symbols logic.

7a218b5

Don't overquote the powershell -Command argument.

983c52b

As previously expressed, powershell perplexingly echoed its -Command argument to stdout, instead of executing it.

Always put static_debug_info.log in main logs directory.

68c0d2e

For a non-BugSplat build, the static_debug_info.log was being written to a dump-UUID subdirectory of the main logs directory. Keep that particular file in the main logs directory so the watchdog process can find it.

Change watchdog process to a rotating log handler.

d70268a

We don't want watchdog.log to grow without bound.

Send Duration as integer seconds.

3c2d1de

bennettgoble closed this Feb 18, 2025

github-actions bot locked and limited conversation to collaborators Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add viewer watchdog process to notice crashes and report. #3498

Add viewer watchdog process to notice crashes and report. #3498

nat-goodspeed commented Feb 3, 2025

Add viewer watchdog process to notice crashes and report. #3498

Add viewer watchdog process to notice crashes and report. #3498

Conversation

nat-goodspeed commented Feb 3, 2025