Skip to content

Commit

Permalink
fix: resolve get_worker_id race by waiting for worker.json to get wri…
Browse files Browse the repository at this point in the history
…tten (aws-deadline#133)

Problem:
 When starting a worker using PosixInstanceWorker it can sometimes be
the case that we query for the Worker's id before the worker.json file
has been written to disk. If this happens then the test will fail.

Solution:
 Repeatedly query for the worker.json file in a delaying loop, up to a
maximum of 10 queries after about a minute.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
  • Loading branch information
ddneilson authored Aug 2, 2024
1 parent 0783917 commit 1f27578
Showing 1 changed file with 11 additions and 1 deletion.
12 changes: 11 additions & 1 deletion src/deadline_test_fixtures/deadline/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -717,7 +717,17 @@ def stop_worker_service(self):
assert cmd_result.exit_code == 0, f"Failed to start Worker Agent service: {cmd_result}"

def get_worker_id(self) -> str:
cmd_result = self.send_command("cat /var/lib/deadline/worker.json | jq -r '.worker_id'")
# There can be a race condition, so we may need to wait a little bit for the status file to be written.

worker_state_filename = "/var/lib/deadline/worker.json"
cmd_result = self.send_command(
" && ".join(
[
f"t=0 && while [ $t -le 10 ] && ! (test -f {worker_state_filename}); do sleep $t; t=$[$t+1]; done"
f"cat {worker_state_filename} | jq -r '.worker_id'"
]
)
)
assert cmd_result.exit_code == 0, f"Failed to get Worker ID: {cmd_result}"

worker_id = cmd_result.stdout.rstrip("\n\r")
Expand Down

0 comments on commit 1f27578

Please sign in to comment.