-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add viewer watchdog process to notice crashes and report. #3498
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
`LLProcess`, being based on APR process management, depends on polling. Until now, `LLProcess` has listened for per-frame events on the "mainloop" `LLEventPump` to poll child processes. The problem with that is that `LLEventPumps` is cleaned up long before ultimate viewer shutdown. To remain in touch with the watchdog child process until quite close to viewer termination, we need a lower-level callback facility: `LLCallbackList`. Defend `LLCallbackList` against re-entrant `addFunction()` or `deleteFunction()` method calls while `callFunctions()` is traversing the list. Also make `LLProcess` more robust wrt reading final output from child-process pipes when the child terminates. llprocess_test.cpp and llleap_test.cpp must now pump `LLCallbackList` instead of the "mainloop" `LLEventPump`.
even when gIdleCallbacks isn't being called regularly. Make LLProcessListener::tick() forward control to every BasePipe::tick() method, instead of registering each individually on gIdleCallbacks. This allows LLProcess::tick() to service all related machinery without reference to gIdleCallbacks. That could be important if gIdleCallbacks still contains callbacks at shutdown time that are no longer viable.
That causes trouble if gIdleCallbacks is destroyed before LLProcessListener.
This version only notices viewer termination and distinguishes normal termination from a crash. It also opens a watchdog.log in the viewer's logs directory so we can tell. So far it logs successful termination as well as detected crashes, but we'll eventually remove success-case logging. Add an `LLProcessPtr mWatchdog` to `LLAppViewer`. Early in `LLAppViewer()`, launch the watchdog process, passing it pipes for its stdin, stdout, stderr. Of the three, the one that matters most is stdin: it's essential that the watchdog receive a pipe from the viewer. That pipe will be closed by the OS on viewer termination, one way or another. Then in `~LLAppViewer()`, if the watchdog has been started and is still running, write "OK" to it to reassure it that the viewer is terminating normally. Make viewer_manifest.py copy watchdog.py into the built viewer image. Introduce Python's `pathlib.Path` into viewer_manifest.py, and defend llmanifest.py against `Path` instances where it expects strings. Clean up some redundancy in viewer_manifest.py.
This deals uniformly with string and Path values.
Also, if we see stderr from watchdog.py, trim trailing newline.
Include new indra/watchdog/CMakeLists.txt, that builds `watchdog` executable, as part of CMake configuration. Make secondlife-bin depend on `watchdog_exe`. Instead of copying `watchdog.py`, make viewer_manifest.py copy the `watchdog` executable. Similarly, instead of running `python watchdog.py`, make `LLAppViewer` directly run the `watchdog` executable. Install PyInstaller as part of the Python dependencies on the GH runner.
This should be useful for any file copy command that differs between platforms only in whether or not the executable ends with ".exe". Also introduce lib_sfx for similar reasons.
Invoke the experimental project/watchdog branch of viewer-build-util, each of whose build-pkg-{windows,mac}/action.yaml scripts now accepts parameters specifying the list of files to sign. Add the watchdog executable for each platform.
The copied logic replaced: ``` cd _artifacts tar -xJf ... ``` with the one-liner: ``` tar -xJf ... -C _artifacts ``` The trouble is that `tar`'s `-C` switch is order-sensitive: it affects the options that *follow* it on the command line.
It needs the "hardened runtime" option.
Somehow the viewer_channel string 'Second Life Project Watchdog' is getting munged between the build job and the post-windows-symbols job to 'Second Life Project watchdog'. Trying to use that viewer_channel to untar the downloaded 'Second Life Project Watchdog.sym.tar.xz' fails.
This is to disambiguate reports sent multiple times for the same viewer run, for crashes without a session ID. Also add `ClientInfo/Channel` key to static_debug_info.log. This is identical to `ClientInfo/Name`, but it might not be obvious to consumers of this data packet that `Name` here means viewer channel. Nonetheless we leave `Name` in place because it's likely that some existing analysis _does_ consult `Name`. Also add `ClientInfo/Version` key. `ClientInfo` already contains `MajorVersion`, `MinorVersion`, `PatchVersion` and `BuildVersion`, but why make the consumer concatenate?
Add to the viewer's logs directory a crashes.json file containing a JSON array of previous crash reports. If that file exists when the watchdog starts up, try to send its contents right away. Stop logging successful viewer termination. On detecting viewer crash, try to read static_debug_info.log as the basis of the new crash report blob. If we can find SecondLife.log, read its creation time, which we set as "Date", the start of the viewer run. Subtract from current time to get "Duration". Adapt logic from viewer-manager aka SLVersionChecker to get a hash of some sort for "MachineID", to group crash reports from the same computer. If we had trouble with any of the above, also record those "Errors". The new crash blob consists of an array of previous crashes plus the current report. Update crashes.json before attempting to send. If the send succeeded, delete crashes.json again. So far the "send" operation merely logs the JSON array of crashes.
As previously expressed, powershell perplexingly echoed its -Command argument to stdout, instead of executing it.
For a non-BugSplat build, the static_debug_info.log was being written to a dump-UUID subdirectory of the main logs directory. Keep that particular file in the main logs directory so the watchdog process can find it.
Turns out that on Windows, using st_birthtime or st_ctime gets you a timestamp going back to the first viewer run on that filesystem. Use st_mtime, which is current. But use st_mtime of SecondLife.start_marker instead of SecondLife.log: the latter's st_mtime will be at the end of the run, not the beginning. SecondLife.start_marker exists to record the beginning of a run.
We don't want watchdog.log to grow without bound.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #3426
replaces #3496 : we need the branch to be named "project/watchdog" so the PR will build.