Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent garbage collection of WebView during testbed testing #2658

Merged
merged 1 commit into from
Jun 16, 2024

Conversation

rmartin16
Copy link
Member

@rmartin16 rmartin16 commented Jun 15, 2024

Changes

  • As outlined in the references below, allowing WebKit-based WebViews to be garbage collected on GTK while the tests are running risks crashing the tests from WebKit raising a SIGABRT
  • This mitigates that crash occurring by maintaining a reference to the Toga WebView until testing has completed, thereby meaning the testing thread has joined, and subsequent garbage collection can only run in the main Python thread

References

PR Checklist:

  • All new features have been tested
  • All new features have been documented
  • I have read the CONTRIBUTING.md file
  • I will abide by the code of conduct

@rmartin16 rmartin16 force-pushed the mitigate-ci-webkit2-crash branch from cf400d2 to a99d686 Compare June 15, 2024 17:04
@rmartin16 rmartin16 marked this pull request as ready for review June 15, 2024 18:54
@rmartin16
Copy link
Member Author

So, this certainly isn't the "remove threading from testbed testing" approach postulated in #2655....but it should at least mitigate this specific crash with Gtk.

Copy link
Member

@freakboy3742 freakboy3742 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm I'm understanding this correctly: the fix here is essentially to make every instance of a WebView persistent for the lifespan of the test suite. Whenever a Toga WebView is created, a reference to that WebView is retained, so it won't be garbage collected (at least, not until the test suite finishes and the App is destroyed - which will be on the main GUI thread, so the crash issue won't exist). Have I understood this correctly?

If so, this will be a slow memory leak - every WebView instance will result in the generation of some unrecoverable memory. Have you got any metrics on how signficant this leak is? Would there be an for a background task in the main GUI thread that periodically polls _gc_protector, looking for instances that have a GC reference count of 1, and when they exist, drop that reference from the _gc_protector list, and run GC? That should guarantee that any GC occurs in the main thread (I think?)

Regardless - given the nature of this crash, I'd be OK upgrading this to a bugfix, rather than a misc changenote.

@rmartin16
Copy link
Member Author

Have I understood this correctly?

Yes; this is intended to prevent garbage collection of the WebViews until after the test thread joins; thus, mitigating the risk that garbage collection for them happens in the test thread.

If so, this will be a slow memory leak - every WebView instance will result in the generation of some unrecoverable memory. Have you got any metrics on how signficant this leak is?

I ran briefcase build -ur --test between the measurements and ran the app directly using the command BRIEFCASE_MAIN_MODULE=tests.testbed build/testbed/ubuntu/jammy/testbed-0.0.1/usr/bin/testbed.

Memory usage from top is virtual mem, resident mem, and then shared mem. Graphs courtesy of valgrind and massif.

Memory consumption using 1060842 (main)

Memory usage for the testbed process from top after test suite finishes: 87.5GB 295256KB 133136KB

image

Memory consumption using a99d686 (this PR)

Memory usage for the testbed process from top after test suite finishes: 88.0GB 316548KB 149968KB

image

Would there be an for a background task in the main GUI thread that periodically polls _gc_protector, looking for instances that have a GC reference count of 1, and when they exist, drop that reference from the _gc_protector list, and run GC? That should guarantee that any GC occurs in the main thread (I think?)

In theory, this approach should help mitigate the crash and also help avoid the increased memory usage.

However, this is basically what the previous mitigation was trying to do:

if toga.platform.current_platform == "linux":
# On Gtk, ensure that the WebView is garbage collection before the next test
# case. This prevents a segfault at GC time likely coming from the test suite
# running in a thread and Gtk WebViews sharing resources between instances.
del widget
gc.collect()

In the context of the app thread, it was trying to delete the object and have garbage collection dispose of it there and then. But these objects were obviously leaking out of this and every once in a while the garbage collector would run in the test thread and dispose of them there causing this crash.

So, if we implement an app background task for this, I expect the occasional crash to still occur for the same reason.

Regardless - given the nature of this crash, I'd be OK upgrading this to a bugfix, rather than a misc changenote.

I'm not sure I'm following; this is only fixing a crash in CI...do we want to tell users about it? And I guess fwiw, users are still completely exposed to the same issue happening to them...although, unless their testing is as robust as ours, they may be unlikely to ever see it...

@freakboy3742
Copy link
Member

If so, this will be a slow memory leak - every WebView instance will result in the generation of some unrecoverable memory. Have you got any metrics on how signficant this leak is?

I ran briefcase build -ur --test between the measurements and ran the app directly using the command BRIEFCASE_MAIN_MODULE=tests.testbed build/testbed/ubuntu/jammy/testbed-0.0.1/usr/bin/testbed.

Ok, that looks like it's observable, but not unmanageable. Obviously "no leak" would be preferable... I'd also like a pony :-) Under the circumstances (stopping an unpredictable hard crash during testing), this seems like a good compromise.

Would there be an for a background task in the main GUI thread that periodically polls _gc_protector, looking for instances that have a GC reference count of 1, and when they exist, drop that reference from the _gc_protector list, and run GC? That should guarantee that any GC occurs in the main thread (I think?)
...
So, if we implement an app background task for this, I expect the occasional crash to still occur for the same reason.

I see what you're driving at, but I'm not 100% sure that would be true. In the existing code, we're trying to force cleanup in the GUI thread, but we're not explicitly validating that the instance being deleted is the last reference. It's possible there are some stale links somewhere else in the system (in widget lookup tabled etc) that might not be cleaned up at the point the cleanup fixture runs; in which case, a subsequent GC run would cleanup up the stale reference and do GC that actually kills the object - and we have no guarantees that this GC pass is in the GUI thread.

That said - given the memory trace you've provided, I don't think this is enough of a problem to be worth spending the resources trying to find a perfect solution; I'm happy to chalk this up to "GTK annoyance" and leave it at that.

Regardless - given the nature of this crash, I'd be OK upgrading this to a bugfix, rather than a misc changenote.

I'm not sure I'm following; this is only fixing a crash in CI...do we want to tell users about it? And I guess fwiw, users are still completely exposed to the same issue happening to them...although, unless their testing is as robust as ours, they may be unlikely to ever see it...

I guess that's fair. I mostly wanted to recognise some really deep debugging work where the failure mode was explosive.

So - consider yourself recognized :-)

@freakboy3742 freakboy3742 merged commit b4c0c91 into beeware:main Jun 16, 2024
31 of 34 checks passed
@rmartin16 rmartin16 deleted the mitigate-ci-webkit2-crash branch June 17, 2024 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intermittent testbed failure in CI on Linux
2 participants