Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragon launcher #580

Merged
merged 19 commits into from
May 13, 2024
Merged

Dragon launcher #580

merged 19 commits into from
May 13, 2024

Conversation

al-rigazzi
Copy link
Collaborator

This PR adds a Dragon-based launcher to SmartSim.

al-rigazzi and others added 15 commits April 4, 2024 15:21
This is the first prototype of the new Dragon-based launcher. The batch launch is still not available for dragon.

[ committed by @al-rigazzi @ankona @MattToast ]
[ reviewed by @MattToast @ankona @al-rigazzi ]

---------

Co-authored-by: Matt Drozt <matthew.drozt@gmail.com>
Co-authored-by: Christopher McBride <christopher.mcbride@gmail.com>
1. ZMQ authenticators appear to have clashing inproc addresses when
using the `zmq.Context.instance()` factory method. Replaced as needed.
2. Updated underlying `Dragon` library version, which included a
breaking changing causing the swap from `TemplateProcess` to
`ProcessTemplate`
3. Fixed incomplete permission set on curve key files

[ committed by @ankona]
[ reviewed by @MattToast @al-rigazzi ]
## Fix a defect in retrieving status updates for the dragon launcher. 

Pre-dragon launchers used the task/step name to retrieve updates while
the dragon launcher uses the `task_id`. This fix ensures that the name
for dragon tasks is mapped appropriately.

[ committed by @ankona ]
[ reviewed by @al-rigazzi ]
Reorder experiment startup to ensure telemetry monitor registers event
listeners prior to launching entities.

[ committed by @ankona ]
[ approved by @MattToast ]
Update the dragon entrypoint to ensure that the log file is removed when
the environment is shutdown.

Additional updates:
- minor refactor to enable testing entrypoint features
- add tests for entrypoint functions
- update incorrect license clause

[ committed by @ankona ]
[ reviewed by @al-rigazzi ]
Add build option to `smart` CLI for installation of Dragon runtime.

### Additional Changes
- minor extract-method refactor to avoid `too-many-statements` linter
issue

### Expected Output

```bash
(ss39) mcbridch@hotlum-login:/lus/bnchlu1/mcbridch/ss> smart build --dragon
[SmartSim] INFO Running SmartSim build process...
[SmartSim] INFO Checking requested versions...
[SmartSim] INFO Checking for build tools...
[SmartSim] DEBUG Retrieved asset metadata: GitReleaseAsset(url="https://api.github.com/repos/DragonHPC/dragon/releases/assets/157545149")
[SmartSim] DEBUG Retrieved https://github.com/DragonHPC/dragon/releases/download/v0.8-beta/dragon-0.8-py3.9.4.1-CRAYEX-ac132fe95.tar.gz to /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon
[SmartSim] INFO Installing dragon from: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon/dragon-0.8/dragon-0.8-cp39-cp39-linux_x86_64.whl
[SmartSim] DEBUG Deleted asset directory: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon
[SmartSim] INFO Dragon installation complete
[SmartSim] INFO Redis build complete!

ML Backends Requested
╒════════════╤════════╤═══════╕
│ PyTorch    │ 2.0.1  │ True  │
│ TensorFlow │ 2.13.1 │ True  │
│ ONNX       │ 1.14.1 │ False │
╘════════════╧════════╧═══════╛

Building for GPU support: False

[SmartSim] INFO Building RedisAI version 1.2.7 from https://github.com/RedisAI/RedisAI.git/
[SmartSim] INFO ML Backends and RedisAI build complete!
[SmartSim] INFO Tensorflow, Torch backend(s) built
[SmartSim] INFO SmartSim build complete!
```

---------

Co-authored-by: Alyssa Cote <46540273+AlyssaCote@users.noreply.github.com>
Co-authored-by: amandarichardsonn <30413257+amandarichardsonn@users.noreply.github.com>
Co-authored-by: Matt Drozt <drozt@hpe.com>

[ reviewed by @al-rigazzi @MattToast ]
[ committed by @ankona ]
This PR actually adds several things:
- stdout and stderr redirect of Dragon-launched processes
- `DragonBatchStep` with logic to keep track of batch jobs run through
SLURM and PBS
- some more env variables were added to `CONFIG` to help with launching
dragon with options
- some mitigation of Authenticator's locking behavior is put in place
- a cooldown period was added to the `DragonBackend` to make sure
telemetry monitor can get updates before it shuts down
- the `DragonBackend` status is now a string representation of two
tables, one for hosts (indicating Free/Busy status) and one for
ProcessGroups (similar to standard WLM output)
- documentation was added for Dragon.

---------

Co-authored-by: Matt Drozt <matthew.drozt@gmail.com>
Co-authored-by: Amanda Richardson <amanda.richardson@hpe.com>
@al-rigazzi al-rigazzi requested a review from ashao May 12, 2024 07:53
Copy link
Contributor

@mellis13 mellis13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (after the tests finish passing). Lots of incredible work in this PR!

@al-rigazzi al-rigazzi added type: feature Issues that include feature request or feature idea area: launcher Issues related to any of the launchers within SmartSim area: api Issues related to API changes area: Dragon labels May 13, 2024
@al-rigazzi al-rigazzi merged commit 8606e8e into develop May 13, 2024
34 checks passed
@al-rigazzi al-rigazzi deleted the dragon_launcher branch September 11, 2024 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: api Issues related to API changes area: Dragon area: launcher Issues related to any of the launchers within SmartSim type: feature Issues that include feature request or feature idea
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants