Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add viewer watchdog process to notice crashes and report. #3498

Closed
wants to merge 30 commits into from

Conversation

nat-goodspeed
Copy link
Contributor

fixes #3426

replaces #3496 : we need the branch to be named "project/watchdog" so the PR will build.

`LLProcess`, being based on APR process management, depends on polling. Until
now, `LLProcess` has listened for per-frame events on the "mainloop" `LLEventPump`
to poll child processes.

The problem with that is that `LLEventPumps` is cleaned up long before
ultimate viewer shutdown. To remain in touch with the watchdog child process
until quite close to viewer termination, we need a lower-level callback
facility: `LLCallbackList`.

Defend `LLCallbackList` against re-entrant `addFunction()` or `deleteFunction()`
method calls while `callFunctions()` is traversing the list.

Also make `LLProcess` more robust wrt reading final output from child-process
pipes when the child terminates.

llprocess_test.cpp and llleap_test.cpp must now pump `LLCallbackList` instead
of the "mainloop" `LLEventPump`.
even when gIdleCallbacks isn't being called regularly.

Make LLProcessListener::tick() forward control to every BasePipe::tick()
method, instead of registering each individually on gIdleCallbacks. This
allows LLProcess::tick() to service all related machinery without reference to
gIdleCallbacks. That could be important if gIdleCallbacks still contains
callbacks at shutdown time that are no longer viable.
That causes trouble if gIdleCallbacks is destroyed before LLProcessListener.
This version only notices viewer termination and distinguishes normal
termination from a crash. It also opens a watchdog.log in the viewer's logs
directory so we can tell. So far it logs successful termination as well as
detected crashes, but we'll eventually remove success-case logging.

Add an `LLProcessPtr mWatchdog` to `LLAppViewer`. Early in `LLAppViewer()`,
launch the watchdog process, passing it pipes for its stdin, stdout, stderr.
Of the three, the one that matters most is stdin: it's essential that the
watchdog receive a pipe from the viewer. That pipe will be closed by the OS on
viewer termination, one way or another.

Then in `~LLAppViewer()`, if the watchdog has been started and is still
running, write "OK" to it to reassure it that the viewer is terminating
normally.

Make viewer_manifest.py copy watchdog.py into the built viewer image.
Introduce Python's `pathlib.Path` into viewer_manifest.py, and defend
llmanifest.py against `Path` instances where it expects strings.
Clean up some redundancy in viewer_manifest.py.
This deals uniformly with string and Path values.
Also, if we see stderr from watchdog.py, trim trailing newline.
Include new indra/watchdog/CMakeLists.txt, that builds `watchdog` executable,
as part of CMake configuration. Make secondlife-bin depend on `watchdog_exe`.

Instead of copying `watchdog.py`, make viewer_manifest.py copy the `watchdog`
executable. Similarly, instead of running `python watchdog.py`, make
`LLAppViewer` directly run the `watchdog` executable.

Install PyInstaller as part of the Python dependencies on the GH runner.
This should be useful for any file copy command that differs between platforms
only in whether or not the executable ends with ".exe".

Also introduce lib_sfx for similar reasons.
Invoke the experimental project/watchdog branch of viewer-build-util, each of
whose build-pkg-{windows,mac}/action.yaml scripts now accepts parameters
specifying the list of files to sign. Add the watchdog executable for each
platform.
The copied logic replaced:
```
cd _artifacts
tar -xJf ...
```
with the one-liner:
```
tar -xJf ... -C _artifacts
```
The trouble is that `tar`'s `-C` switch is order-sensitive: it affects the
options that *follow* it on the command line.
It needs the "hardened runtime" option.
Somehow the viewer_channel string 'Second Life Project Watchdog' is getting
munged between the build job and the post-windows-symbols job to
'Second Life Project watchdog'. Trying to use that viewer_channel to untar the
downloaded 'Second Life Project Watchdog.sym.tar.xz' fails.
This is to disambiguate reports sent multiple times for the same viewer run,
for crashes without a session ID.

Also add `ClientInfo/Channel` key to static_debug_info.log. This is identical
to `ClientInfo/Name`, but it might not be obvious to consumers of this data
packet that `Name` here means viewer channel. Nonetheless we leave `Name` in
place because it's likely that some existing analysis _does_ consult `Name`.

Also add `ClientInfo/Version` key. `ClientInfo` already contains
`MajorVersion`, `MinorVersion`, `PatchVersion` and `BuildVersion`, but why
make the consumer concatenate?
Add to the viewer's logs directory a crashes.json file containing a JSON array
of previous crash reports. If that file exists when the watchdog starts up,
try to send its contents right away.

Stop logging successful viewer termination.

On detecting viewer crash, try to read static_debug_info.log as the basis of
the new crash report blob. If we can find SecondLife.log, read its creation
time, which we set as "Date", the start of the viewer run. Subtract from current
time to get "Duration".

Adapt logic from viewer-manager aka SLVersionChecker to get a hash of some
sort for "MachineID", to group crash reports from the same computer.

If we had trouble with any of the above, also record those "Errors".

The new crash blob consists of an array of previous crashes plus the current
report. Update crashes.json before attempting to send. If the send succeeded,
delete crashes.json again.

So far the "send" operation merely logs the JSON array of crashes.
As previously expressed, powershell perplexingly echoed its -Command argument
to stdout, instead of executing it.
For a non-BugSplat build, the static_debug_info.log was being written to a
dump-UUID subdirectory of the main logs directory. Keep that particular file
in the main logs directory so the watchdog process can find it.
Turns out that on Windows, using st_birthtime or st_ctime gets you a
timestamp going back to the first viewer run on that filesystem. Use st_mtime,
which is current. But use st_mtime of SecondLife.start_marker instead of
SecondLife.log: the latter's st_mtime will be at the end of the run, not the
beginning. SecondLife.start_marker exists to record the beginning of a run.
We don't want watchdog.log to grow without bound.
@github-actions github-actions bot locked and limited conversation to collaborators Feb 18, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants