Skip to content

Commit

Permalink
Fix telemetry monitor listener registration timeline (#549)
Browse files Browse the repository at this point in the history
Reorder experiment startup to ensure telemetry monitor registers event
listeners prior to launching entities.

[ committed by @ankona ]
[ approved by @MattToast ]
  • Loading branch information
ankona authored Apr 16, 2024
1 parent ca38b8a commit b473413
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 16 deletions.
3 changes: 3 additions & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ To be released at some future point in time

Description

- Fix race condition for telemetry monitor
- Update watchdog dependency
- Historical output files stored under .smartsim directory
- Add option to build Torch backend without the Intel Math Kernel Library
Expand All @@ -39,6 +40,7 @@ Description

Detailed Notes

- Ensure the telemetry monitor is started prior to launching entities (SmartSim-PR549_)
- Update watchdog dependency from 3.x to 4.x, fix new type issues (SmartSim-PR540_)
- The dashboard needs to display historical logs, so log files are written
out under the .smartsim directory and files under the experiment
Expand Down Expand Up @@ -95,6 +97,7 @@ Detailed Notes
handler. SmartSim will now attempt to kill any launched jobs before calling
the previously registered signal handler. (SmartSim-PR535_)

.. _SmartSim-PR549: https://github.com/CrayLabs/SmartSim/pull/549
.. _SmartSim-PR540: https://github.com/CrayLabs/SmartSim/pull/540
.. _SmartSim-PR532: https://github.com/CrayLabs/SmartSim/pull/532
.. _SmartSim-PR538: https://github.com/CrayLabs/SmartSim/pull/538
Expand Down
8 changes: 4 additions & 4 deletions smartsim/_core/control/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,10 @@ def start(
The controller will start the job-manager thread upon
execution of all jobs.
"""
# launch a telemetry monitor to track job progress
if CONFIG.telemetry_enabled:
self._start_telemetry_monitor(exp_path)

if isinstance(self._launcher, DragonLauncher):
self._launcher.connect_to_dragon(exp_path)

Expand All @@ -143,10 +147,6 @@ def start(
launched.map(_look_up_launched_data(self._launcher))
)

# launch a telemetry monitor to track job progress
if CONFIG.telemetry_enabled:
self._start_telemetry_monitor(exp_path)

# block until all non-database jobs are complete
if block:
# poll handles its own keyboard interrupt as
Expand Down
12 changes: 0 additions & 12 deletions tests/test_telemetry_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -888,18 +888,6 @@ def test_telemetry_db_and_model(fileutils, test_dir, wlmutils, monkeypatch, conf
try:
exp.start(orc)

# TODO: This sleep is necessary as there is race condition between
# SmartSim and the TM when launching and adding a managed
# task into their respective JMs for tracking. Essentially,
# the TM does not have enough time register file listeners
# before the manifest is updated with the start of
# `smartsim_model` when launching through a Launcher that
# does devolve into a simple system call from the driver
# script process (e.g. the dragon launcher)
# FIXME: THIS NEEDS TO BE REMOVED AND THIS TEST NEEDS TO PASS
# CONSISTENTLY BEFORE WE CAN SHIP A DRAGON LAUNCHER.
time.sleep(1)

# create run settings
app_settings = exp.create_run_settings(sys.executable, test_script)
app_settings.set_nodes(1)
Expand Down

0 comments on commit b473413

Please sign in to comment.