Fixing Worker Pool race condition for assignments #875
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
An awkward race condition existed that allowed Mephisto to attempt to launch an
Assignment
before theWorkerPool
had actually registered all of theAgent
s for thatAssignment
into its localself.agents
dictionary. This could occur if the last_assign_unit_to_agent
thread for an Assignment ran during theawait
-wrappedget_init_data_for_agent
call of the second to last_assign_unit_to_agent
for the sameAssignment
. Then the second to last thread would resume during the last call'sget_init_data_for_agent
. The outcome is thatloop.run_in_executor(None, assignment.get_agents)
would return all theAgent
s, but one wouldn't yet be added to theWorkerPool
's tracking.Fix Details
Rather than just using the
None
check for theAssignment.get_agents()
call, we do aget
check for theWorkerPool
's agents as well, then bail early if any haven't been set.Testing:
Automated testing
Would be nice if there was an
asyncio
testing library that intentionally took unusual routes (like by addingasyncio.sleep
of random durations to allawait
calls) to catch this kind of thing sooner...