Prevent garbage collection of WebView during testbed testing #2658

rmartin16 · 2024-06-15T17:02:13Z

Changes

As outlined in the references below, allowing WebKit-based WebViews to be garbage collected on GTK while the tests are running risks crashing the tests from WebKit raising a SIGABRT
This mitigates that crash occurring by maintaining a reference to the Toga WebView until testing has completed, thereby meaning the testing thread has joined, and subsequent garbage collection can only run in the main Python thread

References

PR Checklist:

All new features have been tested
All new features have been documented
I have read the CONTRIBUTING.md file
I will abide by the code of conduct

rmartin16 · 2024-06-15T19:06:48Z

So, this certainly isn't the "remove threading from testbed testing" approach postulated in #2655....but it should at least mitigate this specific crash with Gtk.

freakboy3742

To confirm I'm understanding this correctly: the fix here is essentially to make every instance of a WebView persistent for the lifespan of the test suite. Whenever a Toga WebView is created, a reference to that WebView is retained, so it won't be garbage collected (at least, not until the test suite finishes and the App is destroyed - which will be on the main GUI thread, so the crash issue won't exist). Have I understood this correctly?

If so, this will be a slow memory leak - every WebView instance will result in the generation of some unrecoverable memory. Have you got any metrics on how signficant this leak is? Would there be an for a background task in the main GUI thread that periodically polls _gc_protector, looking for instances that have a GC reference count of 1, and when they exist, drop that reference from the _gc_protector list, and run GC? That should guarantee that any GC occurs in the main thread (I think?)

Regardless - given the nature of this crash, I'd be OK upgrading this to a bugfix, rather than a misc changenote.

rmartin16 · 2024-06-16T15:31:57Z

Have I understood this correctly?

Yes; this is intended to prevent garbage collection of the WebViews until after the test thread joins; thus, mitigating the risk that garbage collection for them happens in the test thread.

If so, this will be a slow memory leak - every WebView instance will result in the generation of some unrecoverable memory. Have you got any metrics on how signficant this leak is?

I ran briefcase build -ur --test between the measurements and ran the app directly using the command BRIEFCASE_MAIN_MODULE=tests.testbed build/testbed/ubuntu/jammy/testbed-0.0.1/usr/bin/testbed.

Memory usage from top is virtual mem, resident mem, and then shared mem. Graphs courtesy of valgrind and massif.

Memory consumption using `1060842` (`main`)

Memory usage for the testbed process from top after test suite finishes: 87.5GB 295256KB 133136KB

Memory consumption using `a99d686` (this PR)

Memory usage for the testbed process from top after test suite finishes: 88.0GB 316548KB 149968KB

Would there be an for a background task in the main GUI thread that periodically polls _gc_protector, looking for instances that have a GC reference count of 1, and when they exist, drop that reference from the _gc_protector list, and run GC? That should guarantee that any GC occurs in the main thread (I think?)

In theory, this approach should help mitigate the crash and also help avoid the increased memory usage.

However, this is basically what the previous mitigation was trying to do:

toga/testbed/tests/widgets/test_webview.py

Lines 116 to 121 in 1060842

    
           if toga.platform.current_platform == "linux": 
        
               # On Gtk, ensure that the WebView is garbage collection before the next test 
        
               # case. This prevents a segfault at GC time likely coming from the test suite 
        
               # running in a thread and Gtk WebViews sharing resources between instances. 
        
               del widget 
        
               gc.collect()

In the context of the app thread, it was trying to delete the object and have garbage collection dispose of it there and then. But these objects were obviously leaking out of this and every once in a while the garbage collector would run in the test thread and dispose of them there causing this crash.

So, if we implement an app background task for this, I expect the occasional crash to still occur for the same reason.

Regardless - given the nature of this crash, I'd be OK upgrading this to a bugfix, rather than a misc changenote.

I'm not sure I'm following; this is only fixing a crash in CI...do we want to tell users about it? And I guess fwiw, users are still completely exposed to the same issue happening to them...although, unless their testing is as robust as ours, they may be unlikely to ever see it...

freakboy3742 · 2024-06-16T23:24:43Z

If so, this will be a slow memory leak - every WebView instance will result in the generation of some unrecoverable memory. Have you got any metrics on how signficant this leak is?

I ran briefcase build -ur --test between the measurements and ran the app directly using the command BRIEFCASE_MAIN_MODULE=tests.testbed build/testbed/ubuntu/jammy/testbed-0.0.1/usr/bin/testbed.

Ok, that looks like it's observable, but not unmanageable. Obviously "no leak" would be preferable... I'd also like a pony :-) Under the circumstances (stopping an unpredictable hard crash during testing), this seems like a good compromise.

Would there be an for a background task in the main GUI thread that periodically polls _gc_protector, looking for instances that have a GC reference count of 1, and when they exist, drop that reference from the _gc_protector list, and run GC? That should guarantee that any GC occurs in the main thread (I think?)
...
So, if we implement an app background task for this, I expect the occasional crash to still occur for the same reason.

I see what you're driving at, but I'm not 100% sure that would be true. In the existing code, we're trying to force cleanup in the GUI thread, but we're not explicitly validating that the instance being deleted is the last reference. It's possible there are some stale links somewhere else in the system (in widget lookup tabled etc) that might not be cleaned up at the point the cleanup fixture runs; in which case, a subsequent GC run would cleanup up the stale reference and do GC that actually kills the object - and we have no guarantees that this GC pass is in the GUI thread.

That said - given the memory trace you've provided, I don't think this is enough of a problem to be worth spending the resources trying to find a perfect solution; I'm happy to chalk this up to "GTK annoyance" and leave it at that.

Regardless - given the nature of this crash, I'd be OK upgrading this to a bugfix, rather than a misc changenote.

I'm not sure I'm following; this is only fixing a crash in CI...do we want to tell users about it? And I guess fwiw, users are still completely exposed to the same issue happening to them...although, unless their testing is as robust as ours, they may be unlikely to ever see it...

I guess that's fair. I mostly wanted to recognise some really deep debugging work where the failure mode was explosive.

So - consider yourself recognized :-)

Prevent garbage collection of WebView during testbed testing

a99d686

rmartin16 force-pushed the mitigate-ci-webkit2-crash branch from cf400d2 to a99d686 Compare June 15, 2024 17:04

rmartin16 marked this pull request as ready for review June 15, 2024 18:54

freakboy3742 reviewed Jun 16, 2024

View reviewed changes

freakboy3742 approved these changes Jun 16, 2024

View reviewed changes

freakboy3742 merged commit b4c0c91 into beeware:main Jun 16, 2024
31 of 34 checks passed

This was referenced Jun 16, 2024

CI linux testbed crash testing #2655

Closed

Add testbed support for web backend #2662

Open

rmartin16 deleted the mitigate-ci-webkit2-crash branch June 17, 2024 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent garbage collection of WebView during testbed testing #2658

Prevent garbage collection of WebView during testbed testing #2658

rmartin16 commented Jun 15, 2024 •

edited

Loading

rmartin16 commented Jun 15, 2024

freakboy3742 left a comment

rmartin16 commented Jun 16, 2024

freakboy3742 commented Jun 16, 2024

Prevent garbage collection of WebView during testbed testing #2658

Prevent garbage collection of WebView during testbed testing #2658

Conversation

rmartin16 commented Jun 15, 2024 • edited Loading

Changes

References

PR Checklist:

rmartin16 commented Jun 15, 2024

freakboy3742 left a comment

Choose a reason for hiding this comment

rmartin16 commented Jun 16, 2024

Memory consumption using 1060842 (main)

Memory consumption using a99d686 (this PR)

freakboy3742 commented Jun 16, 2024

rmartin16 commented Jun 15, 2024 •

edited

Loading

Memory consumption using `1060842` (`main`)

Memory consumption using `a99d686` (this PR)