Dragon launcher #580

al-rigazzi · 2024-05-10T22:41:40Z

This PR adds a Dragon-based launcher to SmartSim.

@al-rigazzi

This is the first prototype of the new Dragon-based launcher. The batch launch is still not available for dragon. [ committed by @al-rigazzi @ankona @MattToast ] [ reviewed by @MattToast @ankona @al-rigazzi ] --------- Co-authored-by: Matt Drozt <matthew.drozt@gmail.com> Co-authored-by: Christopher McBride <christopher.mcbride@gmail.com>

@ankona

1. ZMQ authenticators appear to have clashing inproc addresses when using the `zmq.Context.instance()` factory method. Replaced as needed. 2. Updated underlying `Dragon` library version, which included a breaking changing causing the swap from `TemplateProcess` to `ProcessTemplate` 3. Fixed incomplete permission set on curve key files [ committed by @ankona] [ reviewed by @MattToast @al-rigazzi ]

@ankona

## Fix a defect in retrieving status updates for the dragon launcher. Pre-dragon launchers used the task/step name to retrieve updates while the dragon launcher uses the `task_id`. This fix ensures that the name for dragon tasks is mapped appropriately. [ committed by @ankona ] [ reviewed by @al-rigazzi ]

@ankona

Reorder experiment startup to ensure telemetry monitor registers event listeners prior to launching entities. [ committed by @ankona ] [ approved by @MattToast ]

@ankona

[ committed by @ankona ] [ reviewed by @al-rigazzi ]

@ankona

Update the dragon entrypoint to ensure that the log file is removed when the environment is shutdown. Additional updates: - minor refactor to enable testing entrypoint features - add tests for entrypoint functions - update incorrect license clause [ committed by @ankona ] [ reviewed by @al-rigazzi ]

@al-rigazzi

Add build option to `smart` CLI for installation of Dragon runtime. ### Additional Changes - minor extract-method refactor to avoid `too-many-statements` linter issue ### Expected Output ```bash (ss39) mcbridch@hotlum-login:/lus/bnchlu1/mcbridch/ss> smart build --dragon [SmartSim] INFO Running SmartSim build process... [SmartSim] INFO Checking requested versions... [SmartSim] INFO Checking for build tools... [SmartSim] DEBUG Retrieved asset metadata: GitReleaseAsset(url="https://api.github.com/repos/DragonHPC/dragon/releases/assets/157545149") [SmartSim] DEBUG Retrieved https://github.com/DragonHPC/dragon/releases/download/v0.8-beta/dragon-0.8-py3.9.4.1-CRAYEX-ac132fe95.tar.gz to /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon [SmartSim] INFO Installing dragon from: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon/dragon-0.8/dragon-0.8-cp39-cp39-linux_x86_64.whl [SmartSim] DEBUG Deleted asset directory: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon [SmartSim] INFO Dragon installation complete [SmartSim] INFO Redis build complete! ML Backends Requested ╒════════════╤════════╤═══════╕ │ PyTorch │ 2.0.1 │ True │ │ TensorFlow │ 2.13.1 │ True │ │ ONNX │ 1.14.1 │ False │ ╘════════════╧════════╧═══════╛ Building for GPU support: False [SmartSim] INFO Building RedisAI version 1.2.7 from https://github.com/RedisAI/RedisAI.git/ [SmartSim] INFO ML Backends and RedisAI build complete! [SmartSim] INFO Tensorflow, Torch backend(s) built [SmartSim] INFO SmartSim build complete! ``` --------- Co-authored-by: Alyssa Cote <46540273+AlyssaCote@users.noreply.github.com> Co-authored-by: amandarichardsonn <30413257+amandarichardsonn@users.noreply.github.com> Co-authored-by: Matt Drozt <drozt@hpe.com> [ reviewed by @al-rigazzi @MattToast ] [ committed by @ankona ]

This PR actually adds several things: - stdout and stderr redirect of Dragon-launched processes - `DragonBatchStep` with logic to keep track of batch jobs run through SLURM and PBS - some more env variables were added to `CONFIG` to help with launching dragon with options - some mitigation of Authenticator's locking behavior is put in place - a cooldown period was added to the `DragonBackend` to make sure telemetry monitor can get updates before it shuts down - the `DragonBackend` status is now a string representation of two tables, one for hosts (indicating Free/Busy status) and one for ProcessGroups (similar to standard WLM output) - documentation was added for Dragon. --------- Co-authored-by: Matt Drozt <matthew.drozt@gmail.com> Co-authored-by: Amanda Richardson <amanda.richardson@hpe.com>

mellis13

LGTM (after the tests finish passing). Lots of incredible work in this PR!

al-rigazzi and others added 15 commits April 4, 2024 15:21

Merge branch 'develop' into dragon_launcher

547e20a

Fix telemetry monitor listener registration timeline (#549)

b473413

Reorder experiment startup to ensure telemetry monitor registers event listeners prior to launching entities. [ committed by @ankona ] [ approved by @MattToast ]

set correct value for curve server public key (#553)

5ef4af5

[ committed by @ankona ] [ reviewed by @al-rigazzi ]

Merge branch 'develop' into dragon_launcher

9312176

formatting

a550528

fix merge conflict fail

4cc9431

fix bad merge conflict resolution

c19651e

Merge branch 'develop' into dragon_launcher

390f9cf

Merge branch 'develop' into dragon_launcher

0096a25

al-rigazzi requested review from mellis13 and ankona May 11, 2024 09:41

al-rigazzi added 2 commits May 12, 2024 09:49

Update changelog

e6dd26c

Restore changelog checker

6428013

al-rigazzi requested a review from ashao May 12, 2024 07:53

al-rigazzi added 2 commits May 12, 2024 10:01

Fix conftest for non-dragon launchers

3fa00e9

Rollback wrong if in conftest

f62f2f3

mellis13 approved these changes May 13, 2024

View reviewed changes

al-rigazzi added type: feature Issues that include feature request or feature idea area: launcher Issues related to any of the launchers within SmartSim area: api Issues related to API changes area: Dragon labels May 13, 2024

al-rigazzi merged commit 8606e8e into develop May 13, 2024
34 checks passed

al-rigazzi deleted the dragon_launcher branch September 11, 2024 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dragon launcher #580

Dragon launcher #580

al-rigazzi commented May 10, 2024

mellis13 left a comment

Dragon launcher #580

Dragon launcher #580

Conversation

al-rigazzi commented May 10, 2024

mellis13 left a comment

Choose a reason for hiding this comment