Local dispatcher improvements fixes #17

johnhaddon · 2023-11-28T15:35:16Z

Generally describe what this PR will do, and why it is needed

List specific new features and changes to project components

Related issues

List any Issues this PR addresses or solves

Dependencies

List any other unmerged PRs that this PR depends on

Breaking changes

List any breaking API/ABI changes (and apply the pr-majorVersion label)

Checklist

I have read the contribution guidelines.
I have updated the documentation, if applicable.
I have tested my change(s) in the test suite, and added new test cases where necessary.
My code follows the Gaffer project's prevailing coding style and conventions.

This is basically the window that the "Execute/View Local Jobs" menu item used to show, but refactored into a dockable editor panel. It still leaves a lot to be desired.

This removes a fair bit of duplicate code. There were quite a few weird little differences in the reporting logic in particular, which I could see no reason for, so they are gone. The two distinctions between foreground and background dispatches that remain are : - Background execution runs tasks in a subprocess, and polls them while checking for cancellation. - Background execution suppresses exceptions (while still reporting them via the job log), while foreground execution propagates them.

- This appears to be completely unused. - A `jobStatusChangedSignal()` would be more generally useful. - But signals about status changing would make more sense on the Jobs themselves. We'll deal with this in a later commit.

We were storing a status per batch, but in fact the only status we ever reported to the outside world (via `failed()` and `killed()`) was the status of the root batch. And to get the root batch into a failed state we were going via a convoluted route whereby `__reportFailed()` would mark the current batch as failed, and call `JobPool._failed()` which would in turn call back to `Job._fail()` to mark the root batch as failed. Much easier to manage a single `self.__status` directly, and limit the per-batch data to a simple flag that tells us if we've executed it yet or not. At the same time, we explicitly store the current batch instead of having to do a search for it every time. Note that we're assuming that attribute assignment (for `self.__currentBatch`) is atomic (it is for CPython) rather than using a mutex to synchronize access between the background and UI threads.

Instead, just keep failed jobs in the main jobs list. This prevents them from jumping around in the LocalJobs UI, and simplifies the API too.

There was no benefit in that.

Also move icons into their own section, and spruce up the other icons a little bit.

Otherwise you get no opportunity to see the job logs in the LocalJobs panel, and for quick jobs its not even clear if the job happened. Also add a `Job.status()` method since there was no public way of determining running/complete status before (it was assumed that anything in the list that wasn't killed or failed was running). Remove `Job.killed()` and `Job.failed()` since they are superceded by `Job.status()`.

This gives us a more natural cancellation mechanism, and means we're using a standard TBB worker thread and therefore are not creating unnecessary thread locals. Since cancellation operates via exceptions, it's more natural to also allow error exceptions to propagate out of `__executeWalk()` rather than using a `False` return value. This also means that all management of `self.__status` can be consolidated into one place in `__executeInternal()`.

This should not have been part of the public API - it only exists so that `doDispatch()` can initiate execution _after_ adding the job to the pool. Nobody needs to call it again after that, and definitely not client code.

And make `removeJob()` public since it is called by the UI, and add public `addJob()` method for symmetry.

The Log tab is more useful.

We'll be updating using our signal handlers anyway.

And refactor the other updates to make it clearer what is being updated and why.

Before they were only updated when selecting a job, so you didn't get any streaming feedback as to what a job was doing.

- Prefix batch messages with the node name, not the job name - Include execution time in batch completion message - Output stack traces line-by-line to work better with MessagesWidget.Role.Log. - Add debug message containing subprocess command lines. - Omit batch messages for root batch

That's what it is.

The LocalJobs Properties tab now contains only static properties of the job, so it is no longer a bug that we're not updating it dynamically when the current batch changes.

- Move just below job listing, since that's what they operate on. - Remove draggable splittable handle separating buttons from widget above. - Reduce size. - Update appropriately when job status changes.

Without this, the BackgroundTask and `outputHandler` threads would continue running while Python was torn down around them on the main thread. This led to crashes when trying to release the GIL from the BackgroundTask. The Changes.md entry doesn't refer to crashes, because the behaviour in 1.3 was different due to us using a Python thread instead of a BackgroundTask. In 1.3 Python would wait patiently for the Job thread to finish, even if that meant waiting a very very very long time. _Unless_ the user hit `Ctrl+C` again, in which case it would forcibly quit, leaving behind zombie task processes. I was in two minds as to whether or not to apply the exit handler to all JobPools or just the default one. Since we don't have any non-default pools outside of the unit tests I don't really have any information to base a decision on, so went with only the default pool for now.

Strictly speaking, no such View should ever exist, because the `in` plug is added in the View constructor. But `ImageView::insertConverter()` currently replaces it with another plug, meaning that it is missing when the replacement is added, which is when `cancelAffectedTasks()` is called. This was causing a crash in `GafferImageUITest.ImageViewTest.testDeriving`. The problem had gone unnoticed until now because previously there were no BackgroundTasks to cancel so we never got as far as the bad check. But now that the LocalDispatcher uses a BackgroundTask, some completed tasks lay dormant waiting for garbage collection after LocalDispatcherTest has run, and we hit the null dereference.

This is probably neither here nor there in actual usage in the UI, but it is important for running the unit tests when the UI module has been imported. That's because `GafferUI.EventLoop` unconditionally installs a UIThreadCallHandler that just buffers the calls until an event loop is started. That prolonged Job lifetimes way outside of their associated tests, eventually causing this spooky action at a distance : ``` Traceback (most recent call last): File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1138, in testComputeNodeCastDoesntRequirePython node = CastChecker() File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1129, in __init__ self["out"] = Gaffer.IntPlug( direction = Gaffer.Plug.Direction.Out ) IECore.Exception: Traceback (most recent call last): File "/__w/gaffer/gaffer/build/python/GafferTest/BoxTest.py", line 1133, in isInstanceOf raise Exception( "Cast to ComputeNode should not require Python" ) Exception: Cast to ComputeNode should not require Python ``` How did that happen? Well, a Job contains a BackgroundTask, and `self["out"]` is a graph edit, and graph edits cancel background tasks, and the cancellation code does a `runTimeCast()` on the subject of the edit. So if a BackgroundTask exists thanks to EventLoop, then the cancellation code runs and does a `runTimeCast()` on `CastChecker`, which throws. It really would be better if EventLoop scoped its UIThreadCallHandler between `start()` and `stop()`.

We were already doing this in a few specific tests but are now doing it for all. Our intention is still that our Python code should never make circular references between Python objects (and we have checks for that in `GafferUITest.TestCase.tearDown()` among other places), so we're not calling `gc.collect()`. But we can't do anything about the unfortunate circular references created by RefCountedBinding itself, and it's preferable to have those broken predictably at the end of the test rather than unpredictably when a later test runs. The immediate motivation for doing this is to destroy the `LocalDispatcher` instances and their associated `JobPool` in `LocalDispatcherTest`, before we run any other tests. If we don't, then the completed BackgroundTasks in the jobs cause a failure in `BoxTest.testComputeNodeCastDoesntRequirePython`. See previous commit for more details.

If we use the default pool, then the internal jobs show up confusingly in the LocalJobs editor. This wasn't a problem before as completed jobs were being removed immediately.

This pollutes other tests with old Job objects hanging around in the pool. The only exception is `LocalDispatcherTest.testShutdownDuringBackgroundDispatch()` which is testing functionality of the default pool, and does so in a subprocess.

The `ps` subprocess is erroring about unknown arguments, which then causes us to return an empty dict. This was causing `LocalDispatcherTest.testShutdownDuringBackgroundDispatch()` to hang forever, waiting to receive the PID.

johnhaddon added 30 commits November 28, 2023 12:03

GafferDispatchUI : Add LocalJobs editor

4468a44

This is basically the window that the "Execute/View Local Jobs" menu item used to show, but refactored into a dockable editor panel. It still leaves a lot to be desired.

GUI Startup : Add LocalJobs pane to standard layouts

150b445

LocalDispatcher : Remove unused member variable

15ff13c

LocalDispatcher : Remove JobPool.jobFailedSignal()

620a884

- This appears to be completely unused. - A `jobStatusChangedSignal()` would be more generally useful. - But signals about status changing would make more sense on the Jobs themselves. We'll deal with this in a later commit.

LocalDispatcher : Remove JobPool.failedJobs()

02b77b7

Instead, just keep failed jobs in the main jobs list. This prevents them from jumping around in the LocalJobs UI, and simplifies the API too.

LocalJobs : Fix quadratic scaling in _LocalJobsPath._children()

8933c0c

LocalDispatcher : Don't derive JobPool from RunTimeTyped

61f02d0

There was no benefit in that.

Graphics : Add completed status icon for dispatcher jobs

c0490b3

Also move icons into their own section, and spruce up the other icons a little bit.

LocalDispatcher : Remove unused errno import

fb77f12

LocalDispatcher : Protect Job.execute() method

dfec9a5

This should not have been part of the public API - it only exists so that `doDispatch()` can initiate execution _after_ adding the job to the pool. Nobody needs to call it again after that, and definitely not client code.

LocalDispatcher : Test JobPool signals

185da44

And make `removeJob()` public since it is called by the UI, and add public `addJob()` method for symmetry.

ParallelAlgo : Add canCallOnUIThread() function

69bac0c

LocalDispatcher : Add jobStatusChangedSignal()

b3d7557

LocalDispatcher : Capture subprocess output

087b59d

LocalJobs : Swap Log and Details tabs round

c717e6f

The Log tab is more useful.

LocalJobs : Remove redundant update calls

6e2beb2

We'll be updating using our signal handlers anyway.

LocalJobs : Update immediately on job status changes

2fbf877

And refactor the other updates to make it clearer what is being updated and why.

LocalDispatcher : Add Job.messages() and Job.messagesChangedSignal()

d5d77f2

LocalJobs : Update messages widget when messages change

02f854d

Before they were only updated when selecting a job, so you didn't get any streaming feedback as to what a job was doing.

LocalDispatcher : Rename __batch to __rootBatch

2c828b4

That's what it is.

Label : Add textSelectable constructor argument

696e0e4

LocalDispatcher/LocalJobs : Add more properties, remove current batch

5269a1c

The LocalJobs Properties tab now contains only static properties of the job, so it is no longer a bug that we're not updating it dynamically when the current batch changes.

LocalJobs : Improve buttons

5cc6903

- Move just below job listing, since that's what they operate on. - Remove draggable splittable handle separating buttons from widget above. - Reduce size. - Update appropriately when job status changes.

johnhaddon and others added 8 commits November 28, 2023 12:03

ArnoldTextureBake : Don't use LocalDispatcher.defaultJobPool()

e3cf82f

If we use the default pool, then the internal jobs show up confusingly in the LocalJobs editor. This wasn't a problem before as completed jobs were being removed immediately.

fixup! GafferDispatchUI : Add LocalJobs editor

394abe0

pathFilterEnumFix

5adb3c1

LocalDispatcher : Fix Job.statistics() errors on Windows

98a702d

The `ps` subprocess is erroring about unknown arguments, which then causes us to return an empty dict. This was causing `LocalDispatcherTest.testShutdownDuringBackgroundDispatch()` to hang forever, waiting to receive the PID.

LocalDispatcherTest : Don't check for killed child process on Windows

3486005

johnhaddon force-pushed the localDispatcherImprovementsFixes branch from 8147cbb to 3486005 Compare November 28, 2023 18:37

johnhaddon closed this Dec 17, 2023

johnhaddon deleted the localDispatcherImprovementsFixes branch March 15, 2024 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local dispatcher improvements fixes #17

Local dispatcher improvements fixes #17

johnhaddon commented Nov 28, 2023

Local dispatcher improvements fixes #17

Local dispatcher improvements fixes #17

Conversation

johnhaddon commented Nov 28, 2023

Related issues

Dependencies

Breaking changes

Checklist